[box type=”note” align=”” class=”” width=””]The following article is an excerpt taken from the book Statistics for Data Science, authored by James D. Miller. The book presents interesting techniques through which you can leverage the power of statistics for data manipulation and analysis.[/box]
In this article, we will be zooming the spotlight on data structures and data models, and also understanding the difference between both.
Data developers will agree that whenever one is working with large amounts of data, the organization of that data is imperative. If that data is not organized effectively, it will be very difficult to perform any task on that data, or at least be able to perform the task in an efficient manner. If the data is organized effectively, then practically any operation can be performed easily on that data.
A data or database developer will then organize the data into what is known as data structures. Following image is a simple binary tree, where the data is organized efficiently by structuring it:
A data structure can be defined as a method of organizing large amounts of data more efficiently so that any operation on that data becomes easy.
Data structures are created in such a way as to implement one or more particular abstract data type (ADT), which in turn will stipulate what operations can be performed on the data structure, as well as the computational complexity of those operations.
[box type=”info” align=”” class=”” width=””]In the field of statistics, an ADT is a model for data types where a data type is defined by its behavior from the point of view (POV) of users of that data, explicitly showing the possible values, the possible operations on data of this type, and the behavior of all of these operations.[/box]
Database design is then the process of using the defined data structures to produce a detailed data model, which will become the database. This data model must contain all of the required logical and physical design selections, as well as the physical storage parameters needed to produce a design in a Data Definition Language (DDL), which can then be used to create an actual database.
[box type=”info” align=”” class=”” width=””]There are varying degrees of the data model, for example, a fully attributed data model would also contain detailed attributes for each entity in the model.[/box]
So, is a data structure a data model?
No, a data structure is used to create a data model. Is this data model the same as data models used in statistics? Let’s see in the next section.
You will find that statistical data models are at the heart of statistical analytics. In the simplest terms, a statistical data model is defined as the following:
A representation of a state, process, or system that we want to understand and reason about
In the scope of the previous definition, the data or database developer might agree that in theory or in concept, one could use the same terms to define a financial reporting database, as it is designed to contain business transactions and is arranged in data structures that allow business analysts to efficiently review the data, so that they can understand or reason about particular interests they may have concerning the business.
Data scientists develop statistical data models so that they can draw inferences from them and, more importantly, make predictions about a topic of concern. Data developers develop databases so that they can similarly draw inferences from them and, more importantly, make predictions about a topic of concern (although perhaps in some organizations, databases are more focused on past and current events (transactions) than on forward-thinking ones (predictions)).
Statistical data models come in a multitude of different formats and flavours (as do databases). These models can be equations linking quantities that we can observe or measure; they can also be simply, sets of rules.
Databases can be designed or formatted to simplify the entering of online transactions—say, in an order entry system—or for financial reporting when the accounting department must generate a balance sheet, income statement, or profit and loss statement for shareholders.
[box type=”info” align=”” class=”” width=””]I found this example of a simple statistical data model: Newton’s Second Law of Motion, which states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied, and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object’s mass.[/box]
What’s the difference?
Where or how does the reader find the difference between a data structure or database and a statistical model? At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model, as shown in the following image:
At a high level, as we speculated in previous sections, one can conclude that a data structure/database is practically the same thing as a statistical data model.
When we take the time to drill deeper into the topic, you should consider the following key points:
- Although both the data structure/database and the statistical model could be said to represent a set of assumptions, the statistical model typically will be found to be much more keenly focused on a particular set of assumptions concerning the generation of some sample data, and similar data from a larger population, while the data structure/database more often than not will be more broadly based
- A statistical model is often in a rather idealized form, while the data structure/database may be less perfect in the pursuit of a specific assumption
- Both a data structure/database and a statistical model are built around relationships between variables
- The data structure/database relationship may focus on answering certain questions, such as:
- What are the total orders for specific customers?
- What are the total orders for a specific customer who has purchased from a certain salesperson?
- Which customer has placed the most orders?
Statistical model relationships are usually very simple, and focused on proving certain questions:
- Females are shorter than males by a fixed amount
- Body mass is proportional to height
- The probability that any given person will partake in a certain sport is a function of age, sex, and socioeconomic status
- Data structures/databases are all about the act of summarizing data based on relationships between variables
The relationships between variables in a statistical model may be found to be much more complicated than simply straightforward to recognize and understand. An illustration of this is awareness of effect statistics. An effect statistic is one that shows or displays a difference in value to one that is associated with a difference related to one or more other variables.
Can you image the SQL query statements you’d use to establish a relationship between two database variables based upon one or more effect statistic? On this point, you may find that a data structure/database usually aims to characterize relationships between variables, while with statistical models, the data scientist looks to fit the model to prove a point or make a statement about the population in the model. That is, a data scientist endeavors to make a statement about the accuracy of an estimate of the effect statistic(s) describing the model!
One more note of interest is that both a data structure/database and a statistical model can be seen as tools or vehicles that aim to generalize a population; a database uses SQL to aggregate or summarize data, and a statistical model summarizes its data using effect statistics.
The above argument presented the notion that data structures/databases and statistical data models are, in many ways, very similar.
If you found this excerpt to be useful, check out the book Statistics for Data Science, which demonstrates different statistical techniques for implementing various data science tasks such as pre-processing, mining, and analysis.