What is Statistical Analysis and why does it matter?

As a data developer, the concept or process of data analysis may be clear to your mind. However, although there happen to be similarities between the art of data analysis and that of statistical analysis, there are important differences to be understood as well.

This article is taken from the book Statistics for Data Science by James D. Miller. This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks.

In this article, we've broken things into the following topics:

What is statistical analysis and it's best practices?

How to establish the nature of data?

What is Statistical analysis?

Some in the study of statistics sometimes describe statistical analysis as part of statistical projects that involves the collection and scrutiny of a data source in an effort to identify trends within the data.

With data analysis, the goal is to validate that the data is appropriate for a need, and with statistical analysis, the goal is to make sense of, and draw some inferences from, the data.

There is a wide range of possible statistical analysis techniques or approaches that can be considered.

How to perform a successful statistical analysis

It is worthwhile to mention some key points, dealing with ensuring a successful (or at least productive) statistical analysis effort.

As soon as you can, decide on your goal or objective. You need to know what the win is, that is, what the problem or idea is that is driving the analysis effort. In addition, you need to make sure that, whatever is driving the analysis, the result obtained must be measurable in some way. This metric or performance indicator must be identified early.

Identify key levers. This means that once you have established your goals and a way to measure performance towards obtaining those goals, you also need to find out what has an effect on the performance towards obtaining each goal.

Conduct a thorough data collection. Typically, the more data the better, but in the absence of quantity, always go with quality.

Clean your data. Make sure your data has been cleaned in a consistent way so that data issues would not impact your conclusions.

Model, model, and model your data. Modeling drives modeling. The more you model your data, the more questions you'll have asked and answered, and the better results you'll have.

Take time to grow in your statistical analysis skills. It's always a good idea to continue to evolve your experiences and style of statistical analysis. The way to improve is to do it. Another approach is to remodel the data you may have on hand for other projects to hone your skills.

Optimize and repeat. As always, you need to take the time for standardizing, following proven practices, using templates, and testing and documenting your scripts and models, so that you can re-use your best efforts over and over again. You will find that this time will be well spent and even your better efforts will improve with use. Finally, share your work with others! The more eyes, the better the product.

Some interesting advice on ensuring success with statistical projects includes the following quote:

It's a good idea to build a team that allows those with an advanced degree in statistics to focus on data modeling and predictions, while others in the team-qualified infrastructure engineers, software developers and ETL experts-build the necessary data collection infrastructure, data pipeline and data products that enable streaming the data through the models and displaying the results to the business in the form of reports and dashboards.
- G Shapira, 2017

Establishing the nature of data

When asked about the objectives of statistical analysis, one often refers to the process of describing or establishing the nature of a data source.

Establishing the nature of something implies gaining an understanding of it. This understanding can be found to be both simple as well as complex. For example, can we determine the types of each of the variables or components found within our data source; are they quantitative, comparative, or qualitative?

A more advanced statistical analysis aims to identify patterns in data; for example, whether there is a relationship between the variables or whether certain groups are more likely to show certain attributes than others.

Exploring the relationships presented in data may appear to be similar to the idea of identifying a foreign key in a relational database, but in statistics, relationships between the components or variables are based upon correlation and causation.

Further, establishing the nature of a data source is also, really, a process of modeling that data source. During modeling, the process always involves asking questions such as the following (in an effort establish the nature of the data):

What? Some common examples of this (what) are revenue, expenses, shipments, hospital visits, website clicks, and so on. In the example, we are measuring quantities, that is, the amount of product that is being moved (sales).

Why? This (why) will typically depend upon your project's specific objectives, which can vary immensely. For example, we may want to track the growth of a business, the activity on a website, or the evolution of a selected product or market interest. Again, in our current transactional data example, we may want to identify over- and under-performing sales types, and determine if, new or repeat customers provide more or fewer sales?

How? The how will most likely be over a period of time (perhaps a year, month, week, and so on) and then by some other related measure, such as a product, state, region, reseller, and so on. Within our transactional data example, we've focused on the observation of quantities by sale type.

Another way to describe establishing the nature of your data is adding context to it or profiling it. In any case, the objective is to allow the data consumer to better understand the data through visualization.

Another motive for adding context or establishing the nature of your data can be to gain a new perspective on the data.

In this article, we explored the purpose and process of statistical analysis and listed the steps involved in a successful statistical analysis.

Next, to learn about statistical regression and why it is important to data science, read our book Statistics for Data Science.

Estimating population statistics with Point Estimation.

Why You Need to Know Statistics To Be a Good Data Scientist.

Why choose IBM SPSS Statistics over R for your data analysis project.