The Data Science Venn Diagram

October 21, 2016 - 12:00 am

4598

14 min read

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas:

Math/statistics: This is the use of equations and formulas to perform analysis
Computer programming: This is the ability to use code to create outcomes on the computer
Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on)

(For more resources related to this topic, see here.)

Learn Programming & Development with a Packt Subscription

The following Venn diagram provides a visual representation of how the three areas of data science intersect:

The Venn diagram of data science

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way.

While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science.

The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.

Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.

Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don’t understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless.

Domain knowledge comes with both practice of data science and reading examples of other people’s analyses.

The math

Most people stop listening once someone says the word “math”. They’ll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models.

A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model.

Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists.

Example – Spawner-Recruit Models

In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa?

Essentially, models allow us to plug in one variable to get the other. Consider the following example:

In this example, let’s say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following:

This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.

There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the “best” model possible. We no longer rely on human instincts, rather, we rely on data.

Spawner-Recruit model visualized

The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.

Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.

Computer programming

Let’s be honest. You probably think computer science is way cooler than math. That’s ok, I don’t blame you. The news isn’t filled with math news like it is with news on the technological front. You don’t turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python.

Why Python?

We will use Python for a variety of reasons:

Python is an extremely simple language to read and write even if you’ve coded before, which will make future examples easy to ingest and read later.
It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact).
The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations.
Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize.

The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows:

pandas
sci-kit learn
seaborn
numpy/scipy
requests (to mine data from the web)
BeautifulSoup (for Web HTML parsing)

Python practices

Before we move on, it is important to formalize many of the requisite coding skills in Python.

In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first:

int (an integer)
- Examples: 3, 6, 99, -34, 34, 11111111
float (a decimal):
- Examples: 3.14159, 2.71, -0.34567
boolean (either true or false)
- The statement, Sunday is a weekend, is true
- The statement, Friday is a weekend, is false
- The statement, pi is exactly the ratio of a circle’s circumference to its diameter, is true (crazy, right?)
string (text or words made up of characters)
- I love hamburgers (by the way who doesn’t?)
- Matt is awesome
- A Tweet is a string
a list (a collection of objects)
- Example: 1, 5.4, True, “apple”

We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false.

== evaluates to true if both sides are equal, otherwise it evaluates to false
- 3 + 4 == 7 (will evaluate to true)
- 3 – 2 == 7 (will evaluate to false)
< (less than)
- 3 < 5 (true)
- 5 < 3 (false)
<= (less than or equal to)
- 3 <= 3 (true)
- 5 <= 3 (false)
> (greater than)
- 3 > 5 (false)
- 5 > 3 (true)
>= (greater than or equal to)
- 3 >= 3 (true)
- 5 >= 3 (false)

When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed.

Example of basic Python

In Python, we use spaces/tabs to denote operations that belong to other lines of code.

Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code:
X = 5.8
Y = 9.5

X + Y == 15.3  # This is True!

X - Y == 15.3  # This is False!

if x + y == 15.3:   # If the statement is true:
  print "True!"     # print something!

The print “True!” belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3.

Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order):

my_list = [1, 5.7, True, "apples"]

len(my_list) == 4  # 4 objects in the list

my_list[0] == 1    # the first object


my_list[1] == 5.7    # the second object

In the preceding code:

I used the len command to get the length of the list (which was four).
Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95^th element, I call the index 94.

Example – parsing a single Tweet

Here is some more Python code. In this example, I will be parsing some tweets about stock prices:

tweet = "RT @j_o_n_dnger: $TWTR now top holding for 
             Andor, unseating $AAPL"

words_in_tweet = first_tweet.split(' ') # list of words in tweet

for word in words_in_tweet:             # for each word in list
  if "$" in word:                       # if word has a "cashtag" 
  print "THIS TWEET IS ABOUT", word  # alert the user

I will point out a few things about this code snippet, line by line, as follows:

We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is
“RT @robdv: $TWTR now top holding for Andor, unseating $AAPL”
The words_in_tweet variable “tokenizes” the tweet (separates it by word). If you were to print this variable, you would see the following:
```
"['RT',
'@robdv:',
'$TWTR',
'now',
'top',
'holding',
'for',
'Andor,',
'unseating',
'$AAPL']
```
We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one.
Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter).
If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user.

The output of this code will be as follows:

We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code.

Domain knowledge

As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field.

Does that mean that if you’re not a doctor, you can’t work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren’t fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.

A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.

Some more terminology

This is a good time to define some more vocabulary. By this point, you’re probably excitedly looking up a lot of data science material and seeing words and phrases I haven’t used yet. Here are some common terminologies you are likely to come across:

Machine learning: This refers to giving computers the ability to learn from data without explicit “rules” being given by a programmer.
Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models:
Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness
Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula

While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways.

Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights
Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models.
Data mining – This is the process of finding relationships between elements of data.
Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model).

I tried pretty hard not to use the term big data up until now. It’s because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).

The state of data science so far (this diagram is incomplete and is meant for visualization purposes only).

Summary

More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles’ data, tweets, online reviews, and much more in order to jumpstart their education.

However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model.

Resources for Article:

Further resources on this subject: