It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas:
(For more resources related to this topic, see here.)
The following Venn diagram provides a visual representation of how the three areas of data science intersect:
Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way.
While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science.
The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.
Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.
Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don’t understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless.
Domain knowledge comes with both practice of data science and reading examples of other people’s analyses.
Most people stop listening once someone says the word “math”. They’ll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models.
A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.
Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model.
Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists.
In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa?
Essentially, models allow us to plug in one variable to get the other. Consider the following example:
In this example, let’s say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following:
This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.
There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the “best” model possible. We no longer rely on human instincts, rather, we rely on data.
The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.
Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.
Let’s be honest. You probably think computer science is way cooler than math. That’s ok, I don’t blame you. The news isn’t filled with math news like it is with news on the technological front. You don’t turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python.
We will use Python for a variety of reasons:
The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows:
Before we move on, it is important to formalize many of the requisite coding skills in Python.
In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first:
We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false.
When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed.
In Python, we use spaces/tabs to denote operations that belong to other lines of code.
Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code:
X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something!
The print “True!” belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3.
Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order):
my_list = [1, 5.7, True, "apples"]
len(my_list) == 4 # 4 objects in the list
my_list[0] == 1 # the first object
my_list[1] == 5.7 # the second object
In the preceding code:
Here is some more Python code. In this example, I will be parsing some tweets about stock prices:
tweet = "RT @j_o_n_dnger: $TWTR now top holding for
Andor, unseating $AAPL"
words_in_tweet = first_tweet.split(' ') # list of words in tweet
for word in words_in_tweet: # for each word in list
if "$" in word: # if word has a "cashtag"
print "THIS TWEET IS ABOUT", word # alert the user
I will point out a few things about this code snippet, line by line, as follows:
“RT @robdv: $TWTR now top holding for Andor, unseating $AAPL”
"['RT',
'@robdv:',
'$TWTR',
'now',
'top',
'holding',
'for',
'Andor,',
'unseating',
'$AAPL']
The output of this code will be as follows:
We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code.
As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field.
Does that mean that if you’re not a doctor, you can’t work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren’t fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.
A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.
This is a good time to define some more vocabulary. By this point, you’re probably excitedly looking up a lot of data science material and seeing words and phrases I haven’t used yet. Here are some common terminologies you are likely to come across:
Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models:
While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways.
Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models.
Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model).
I tried pretty hard not to use the term big data up until now. It’s because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).
More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles’ data, tweets, online reviews, and much more in order to jumpstart their education.
However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…