This article written by Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, and Abhijit Dasgupta, authors of the book Practical Data Science Cookbook, will cover the following topics:
(For more resources related to this topic, see here.)
The dataset, available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, contains fuel efficiency performance metrics over time for all makes and models of automobiles in the United States of America. This dataset also contains numerous other features and attributes of the automobile models other than fuel economy, providing an opportunity to summarize and group the data so that we can identify interesting trends and relationships.
We will perform the entire analysis using Python. However, we will ask the same questions and follow the same sequence of steps as before, again following the data science pipeline. With study, this will allow you to see the similarities and differences between the two languages for a mostly identical analysis.
In this article, we will take a very different approach using Python as a scripting language in an interactive fashion that is more similar to R. We will introduce the reader to the unofficial interactive environment of Python, IPython, and the IPython notebook, showing how to produce readable and well-documented analysis scripts. Further, we will leverage the data analysis capabilities of the relatively new but powerful pandas library and the invaluable data frame data type that it offers. pandas often allows us to complete complex tasks with fewer lines of code. The drawback to this approach is that while you don’t have to reinvent the wheel for common data manipulation tasks, you do have to learn the API of a completely different package, which is pandas.
The goal of this article is not to guide you through an analysis project that you have already completed but to show you how that project can be completed in another language. More importantly, we want to get you, the reader, to become more introspective with your own code and analysis. Think not only about how something is done but why something is done that way in that particular language. How does the language shape the analysis?
IPython is the interactive computing shell for Python that will change the way you think about interactive shells. It brings to the table a host of very useful functionalities that will most likely become part of your default toolbox, including magic functions, tab completion, easy access to command-line tools, and much more. We will only scratch the surface here and strongly recommend that you keep exploring what can be done with IPython.
If you have completed the installation, you should be ready to tackle the following recipes. Note that IPython 2.0, which is a major release, was launched in 2014.
The following steps will get you up and running with the IPython environment:
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]:
Note that your version might be slightly different than what is shown in the preceding command-line output.
n = 100000
%timeit range(n)
%timeit xrange(n)
We should get an output like this:
1000 loops, best of 3: 1.22 ms per loop
1000000 loops, best of 3: 258 ns per loop
This shows you how much faster xrange is than range (1.22 milliseconds versus 2.58 nanoseconds!) and helps show you the utility of generators in Python.
!ping www.google.com
You should see the following output:
PING google.com (74.125.22.101): 56 data bytes
64 bytes from 74.125.22.101: icmp_seq=0 ttl=38 time=40.733 ms
64 bytes from 74.125.22.101: icmp_seq=1 ttl=38 time=40.183 ms
64 bytes from 74.125.22.101: icmp_seq=2 ttl=38 time=37.635 ms
%history 1
There isn’t much to explain here and we have just scratched the surface of what IPython can do. Hopefully, we have gotten you interested in diving deeper, especially with the wealth of new features offered by IPython 2.0, including dynamic and user-controllable data visualizations.
IPython Notebook is the perfect complement to IPython. As per the IPython website:
“The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.”
While this is a bit of a mouthful, it is actually a pretty accurate description. In practice, IPython Notebook allows you to intersperse your code with comments and images and anything else that might be useful. You can use IPython Notebooks for everything from presentations (a great replacement for PowerPoint) to an electronic laboratory notebook or a textbook.
If you have completed the installation, you should be ready to tackle the following recipes.
These steps will get you started with exploring the incredibly powerful IPython Notebook environment. We urge you to go beyond this simple set of steps to understand the true power of the tool.
For those of you coming from either more traditional statistical software packages, such as Stata, SPSS, or SAS, or more traditional mathematical software packages, such as MATLAB, Mathematica, or Maple, you are probably used to the very graphical and feature-rich interactive environments provided by the respective companies. From this background, IPython Notebook might seem a bit foreign but hopefully much more user friendly and less intimidating than the traditional Python prompt. Further, IPython Notebook offers an interesting combination of interactivity and sequential workflow that is particularly well suited for data analysis, especially during the prototyping phases. R has a library called Knitr (http://yihui.name/knitr/) that offers the report-generating capabilities of IPython Notebook.
When you type in ipython notebook, you are launching a server running on your local machine, and IPython Notebook itself is really a web application that uses a server-client architecture. The IPython Notebook server, as per ipython.org, uses a two-process kernel architecture with ZeroMQ (http://zeromq.org/) and Tornado. ZeroMQ is an intelligent socket library for high-performance messaging, helping IPython manage distributed compute clusters among other tasks. Tornado is a Python web framework and asynchronous networking module that serves IPython Notebook’s HTTP requests. The project is open source and you can contribute to the source code if you are so inclined.
IPython Notebook also allows you to export your notebooks, which are actually just text files filled with JSON, to a large number of alternative formats using the command-line tool called nbconvert (http://ipython.org/ipython-doc/rel-1.0.0/interactive/nbconvert.html). Available export formats include HTML, LaTex, reveal.js HTML slideshows, Markdown, simple Python scripts, and reStructuredText for the Sphinx documentation.
Finally, there is IPython Notebook Viewer (nbviewer), which is a free web service where you can both post and go through static, HTML versions of notebook files hosted on remote servers (these servers are currently donated by Rackspace). Thus, if you create an amazing .ipynb file that you want to share, you can upload it to http://nbviewer.ipython.org/ and let the world see your efforts.
We will try not to sing too loudly the praises of Markdown, but if you are unfamiliar with the tool, we strongly suggest that you try it out. Markdown is actually two different things: a syntax for formatting plain text in a way that can be easily converted to a structured document and a software tool that converts said text into HTML and other languages. Basically, Markdown enables the author to use any desired simple text editor (VI, VIM, Emacs, Sublime editor, TextWrangler, Crimson Editor, or Notepad) that can capture plain text yet still describe relatively complex structures such as different levels of headers, ordered and unordered lists, and block quotes as well as some formatting such as bold and italics. Markdown basically offers a very human-readable version of HTML that is similar to JSON and offers a very human-readable data format.
In this recipe, we are going to start our Python-based analysis of the automobile fuel efficiencies data.
If you completed the first installation successfully, you should be ready to get started.
The following steps will see you through setting up your working directory and IPython for the analysis for this article:
ipython notebook
import pandas as pd
import numpy as np
from ggplot import *
%matplotlib inline
Then, hit Shift + Enter to execute the cell. This imports both the pandas and numpy libraries, assigning them local names to save a few characters while typing commands. It also imports the ggplot library. Please note that using the from ggplot import * command line is not a best practice in Python and pours the ggplot package contents into our default namespace. However, we are doing this so that our ggplot syntax most closely resembles the R ggplot2 syntax, which is strongly not Pythonic. Finally, we use a magic command to tell IPython Notebook that we want matploblib graphs to render in the notebook.
vehicles = pd.read_csv("vehicles.csv")
vehicles.head
Then, press Shift + Enter. The following text should be shown:
However, notice that a red warning message appears as follows:
/Library/Python/2.7/site-packages/pandas/io/parsers.py:1070:
DtypeWarning: Columns (22,23,70,71,72,73) have mixed types.
Specify dtype option on import or set low_memory=False. data
= self._reader.read(nrows)
This tells us that columns 22, 23, 70, 71, 72, and 73 contain mixed data types. Let’s find the corresponding names using the following commands:
column_names = vehicles.columns.values column_names[[22, 23, 70, 71, 72, 73]] array([cylinders, displ, fuelType2, rangeA, evMotor, mfrCode], dtype=object)
Mixed data types sounds like it could be problematic so make a mental note of these column names. Remember, data cleaning and wrangling often consume 90 percent of project time.
With this recipe, we are simply setting up our working directory and creating a new IPython Notebook that we will use for the analysis. We have imported the pandas library and very quickly read the vehicles.csv data file directly into a data frame. Speaking from experience, pandas’ robust data import capabilities will save you a lot of time.
Although we imported data directly from a comma-separated value file into a data frame, pandas is capable of handling many other formats, including Excel, HDF, SQL, JSON, Stata, and even the clipboard using the reader functions. We can also write out the data from data frames in just as many formats using writer functions accessed from the data frame object.
Using the bound method head that is part of the Data Frame class in pandas, we have received a very informative summary of the data frame, including a per-column count of non-null values and a count of the various data types across the columns.
The data frame is an incredibly powerful concept and data structure. Thinking in data frames is critical for many data analyses yet also very different from thinking in array or matrix operations (say, if you are coming from MATLAB or C as your primary development languages).
With the data frame, each column represents a different variable or characteristic and can be a different data type, such as floats, integers, or strings. Each row of the data frame is a separate observation or instance with its own set of values. For example, if each row represents a person, the columns could be age (an integer) and gender (a category or string). Often, we will want to select the set of observations (rows) that match a particular characteristic (say, all males) and examine this subgroup. The data frame is conceptually very similar to a table in a relational database.
Now that we have imported the automobile fuel efficiency dataset into IPython and witnessed the power of pandas, the next step is to replicate the preliminary analysis performed in R, getting your feet wet with some basic pandas functionality.
We will continue to grow and develop the IPython Notebook that we started in the previous recipe. If you’ve completed the previous recipe, you should have everything you need to continue.
len(vehicles) 34287
If you switch back and forth between R and Python, remember that in R, the function is length and in Python, it is len.
len(vehicles.columns) 74
print(vehicles.columns)
Index([u'barrels08', u'barrelsA08', u'charge120',
u'charge240', u'city08', u'city08U', u'cityA08', u'cityA08U',
u'cityCD', u'cityE', u'cityUF', u'co2', u'co2A',
u'co2TailpipeAGpm', u'co2TailpipeGpm', u'comb08', u'comb08U',
u'combA08', u'combA08U', u'combE', u'combinedCD',
u'combinedUF', u'cylinders', u'displ', u'drive', u'engId',
u'eng_dscr', u'feScore', u'fuelCost08', u'fuelCostA08',
u'fuelType', u'fuelType1', u'ghgScore', u'ghgScoreA',
u'highway08', u'highway08U', u'highwayA08', u'highwayA08U',
u'highwayCD', u'highwayE', u'highwayUF', u'hlv', u'hpv',
u'id', u'lv2', u'lv4', u'make', u'model', u'mpgData',
u'phevBlended', u'pv2', u'pv4', u'range', u'rangeCity',
u'rangeCityA', u'rangeHwy', u'rangeHwyA', u'trany', u'UCity',
u'UCityA', u'UHighway', u'UHighwayA', u'VClass', u'year',
u'youSaveSpend', u'guzzler', u'trans_dscr', u'tCharger',
u'sCharger', u'atvType', u'fuelType2', u'rangeA', u'evMotor',
u'mfrCode'], dtype=object)
The u letter in front of each string indicates that the strings are represented in Unicode (http://docs.python.org/2/howto/unicode.html)
len(pd.unique(vehicles.year)) 31 min(vehicles.year) 1984 max(vehicles["year"]) 2014
Note that again, we have used two different syntaxes to reference individual columns within the vehicles data frame.
pd.value_counts(vehicles.fuelType1)
Regular Gasoline 24587
Premium Gasoline 8521
Diesel 1025
Natural Gas 57
Electricity 56
Midgrade Gasoline 41
dtype: int64
pd.value_counts(vehicles.trany)
However, this results in a bit of unexpected and lengthy output:
What we really want to know is the number of cars with automatic and manual transmissions. We notice that the trany variable always starts with the letter A when it represents an automatic transmission and M for manual transmission. Thus, we create a new variable, trany2, that contains the first character of the trany variable, which is a string:
vehicles["trany2"] = vehicles.trany.str[0]
pd.value_counts(vehicles.trany2)
The preceding command yields the answer that we wanted or twice as many automatics as manuals:
A 22451
M 11825
dtype: int64
In this recipe, we looked at some basic functionality in Python and pandas. We have used two different syntaxes (vehicles[‘trany’] and vehicles.trany) to access variables within the data frame. We have also used some of the core pandas functions to explore the data, such as the incredibly useful unique and the value_counts function.
In terms of the data science pipeline, we have touched on two stages in a single recipe: data cleaning and data exploration. Often, when working with smaller datasets where the time to complete a particular action is quite short and can be completed on our laptop, we will very quickly go through multiple stages of the pipeline and then loop back, depending on the results. In general, the data science pipeline is a highly iterative process. The faster we can accomplish steps, the more iterations we can fit into a fixed time, and often, we can create a better final analysis.
This article took you through the process of analyzing and visualizing automobile data to identify trends and patterns in fuel efficiency over time using the powerful programming language, Python.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…