This article written by Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, and Abhijit Dasgupta, authors of the book Practical Data Science Cookbook, will cover the following topics:
- Getting started with IPython
- Exploring IPython Notebook
- Preparing to analyze automobile fuel efficiencies
- Exploring and describing the fuel efficiency data with Python
(For more resources related to this topic, see here.)
The dataset, available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip, contains fuel efficiency performance metrics over time for all makes and models of automobiles in the United States of America. This dataset also contains numerous other features and attributes of the automobile models other than fuel economy, providing an opportunity to summarize and group the data so that we can identify interesting trends and relationships.
We will perform the entire analysis using Python. However, we will ask the same questions and follow the same sequence of steps as before, again following the data science pipeline. With study, this will allow you to see the similarities and differences between the two languages for a mostly identical analysis.
In this article, we will take a very different approach using Python as a scripting language in an interactive fashion that is more similar to R. We will introduce the reader to the unofficial interactive environment of Python, IPython, and the IPython notebook, showing how to produce readable and well-documented analysis scripts. Further, we will leverage the data analysis capabilities of the relatively new but powerful pandas library and the invaluable data frame data type that it offers. pandas often allows us to complete complex tasks with fewer lines of code. The drawback to this approach is that while you don’t have to reinvent the wheel for common data manipulation tasks, you do have to learn the API of a completely different package, which is pandas.
The goal of this article is not to guide you through an analysis project that you have already completed but to show you how that project can be completed in another language. More importantly, we want to get you, the reader, to become more introspective with your own code and analysis. Think not only about how something is done but why something is done that way in that particular language. How does the language shape the analysis?
Getting started with IPython
IPython is the interactive computing shell for Python that will change the way you think about interactive shells. It brings to the table a host of very useful functionalities that will most likely become part of your default toolbox, including magic functions, tab completion, easy access to command-line tools, and much more. We will only scratch the surface here and strongly recommend that you keep exploring what can be done with IPython.
If you have completed the installation, you should be ready to tackle the following recipes. Note that IPython 2.0, which is a major release, was launched in 2014.
How to do it…
The following steps will get you up and running with the IPython environment:
- Open up a terminal window on your computer and type ipython. You should be immediately presented with the following text:
Python 2.7.5 (default, Mar 9 2014, 22:15:05)
Type "copyright", "credits" or "license" for more information. IPython 2.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
Note that your version might be slightly different than what is shown in the preceding command-line output.
- Just to show you how great IPython is, type in ls, and you should be greeted with the directory listing! Yes, you have access to common Unix commands straight from your Python prompt inside the Python interpreter.
- Now, let’s try changing directories. Type cd at the prompt, hit space, and now hit Tab. You should be presented with a list of directories available from within the current directory. Start typing the first few letters of the target directory, and then, hit Tab again. If there is only one option that matches, hitting the Tab key automatically will insert that name. Otherwise, the list of possibilities will show only those names that match the letters that you have already typed. Each letter that is entered acts as a filter when you press Tab.
- Now, type ?, and you will get a quick introduction to and overview of IPython’s features.
- Let’s take a look at the magic functions. These are special functions that IPython understands and will always start with the % symbol. The %paste function is one such example and is amazing for copying and pasting Python code into IPython without losing proper indentation.
- We will try the %timeit magic function that intelligently benchmarks Python code. Enter the following commands:
n = 100000
We should get an output like this:
1000 loops, best of 3: 1.22 ms per loop
1000000 loops, best of 3: 258 ns per loop
This shows you how much faster xrange is than range (1.22 milliseconds versus 2.58 nanoseconds!) and helps show you the utility of generators in Python.
- You can also easily run system commands by prefacing the command with an exclamation mark. Try the following command:
You should see the following output:
PING google.com (220.127.116.11): 56 data bytes
64 bytes from 18.104.22.168: icmp_seq=0 ttl=38 time=40.733 ms
64 bytes from 22.214.171.124: icmp_seq=1 ttl=38 time=40.183 ms
64 bytes from 126.96.36.199: icmp_seq=2 ttl=38 time=37.635 ms
- Finally, IPython provides an excellent command history. Simply press the up arrow key to access the previously entered command. Continue to press the up arrow key to walk backwards through the command list of your session and the down arrow key to come forward. Also, the magic %history command allows you to jump to a particular command number in the session. Type the following command to see the first command that you entered:
- Now, type exit to drop out of IPython and back to your system command prompt.
How it works…
There isn’t much to explain here and we have just scratched the surface of what IPython can do. Hopefully, we have gotten you interested in diving deeper, especially with the wealth of new features offered by IPython 2.0, including dynamic and user-controllable data visualizations.
- IPython at http://ipython.org/
- The IPython Cookbook at https://github.com/ipython/ipython/wiki?path=Cookbook
- IPython: A System for Interactive Scientific Computing at http://fperez.org/papers/ipython07_pe-gr_cise.pdf
- Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing, available at http://www.packtpub.com/learning-ipython-for-interactive-computing-and-data-visualization/book
- The future of IPython at http://www.infoworld.com/print/236429
Exploring IPython Notebook
IPython Notebook is the perfect complement to IPython. As per the IPython website:
“The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document.”
While this is a bit of a mouthful, it is actually a pretty accurate description. In practice, IPython Notebook allows you to intersperse your code with comments and images and anything else that might be useful. You can use IPython Notebooks for everything from presentations (a great replacement for PowerPoint) to an electronic laboratory notebook or a textbook.
If you have completed the installation, you should be ready to tackle the following recipes.
How to do it…
These steps will get you started with exploring the incredibly powerful IPython Notebook environment. We urge you to go beyond this simple set of steps to understand the true power of the tool.
- Type ipython notebook –pylab=inline in the command prompt. The –pylab=inline option should allow your plots to appear inline in your notebook. You should see some text quickly scroll by in the terminal window, and then, the following screen should load in the default browser (for me, this is Chrome). Note that the URL should be http://127.0.0.1:8888/, indicating that the browser is connected to a server running on the local machine at port 8888.
- You should not see any notebooks listed in the browser (note that IPython Notebook files have a .ipynb extension) as IPython Notebook searches the directory you launched it from for notebook files. Let’s create a notebook now. Click on the New Notebook button in the upper right-hand side of the page. A new browser tab or window should open up, showing you something similar to the following screenshot:
- From the top down, you can see the text-based menu followed by the toolbar for issuing common commands, and then, your very first cell, which should resemble the command prompt in IPython.
- Place the mouse cursor in the first cell and type 5+5. Next, either navigate to Cell | Run or press Shift + Enter as a keyboard shortcut to cause the contents of the cell to be interpreted. You should now see something similar to the following screenshot. Basically, we just executed a simple Python statement within the first cell of our first IPython Notebook.
- Click on the second cell, and then, navigate to Cell | Cell Type | Markdown. Now, you can easily write markdown in the cell for documentation purposes.
- Close the two browser windows or tabs (the notebook and the notebook browser).
- Go back to the terminal in which you typed ipython notebook, hit Ctrl + C, then hit Y, and press Enter. This will shut down the IPython Notebook server.
How it works…
For those of you coming from either more traditional statistical software packages, such as Stata, SPSS, or SAS, or more traditional mathematical software packages, such as MATLAB, Mathematica, or Maple, you are probably used to the very graphical and feature-rich interactive environments provided by the respective companies. From this background, IPython Notebook might seem a bit foreign but hopefully much more user friendly and less intimidating than the traditional Python prompt. Further, IPython Notebook offers an interesting combination of interactivity and sequential workflow that is particularly well suited for data analysis, especially during the prototyping phases. R has a library called Knitr (http://yihui.name/knitr/) that offers the report-generating capabilities of IPython Notebook.
When you type in ipython notebook, you are launching a server running on your local machine, and IPython Notebook itself is really a web application that uses a server-client architecture. The IPython Notebook server, as per ipython.org, uses a two-process kernel architecture with ZeroMQ (http://zeromq.org/) and Tornado. ZeroMQ is an intelligent socket library for high-performance messaging, helping IPython manage distributed compute clusters among other tasks. Tornado is a Python web framework and asynchronous networking module that serves IPython Notebook’s HTTP requests. The project is open source and you can contribute to the source code if you are so inclined.
IPython Notebook also allows you to export your notebooks, which are actually just text files filled with JSON, to a large number of alternative formats using the command-line tool called nbconvert (http://ipython.org/ipython-doc/rel-1.0.0/interactive/nbconvert.html). Available export formats include HTML, LaTex, reveal.js HTML slideshows, Markdown, simple Python scripts, and reStructuredText for the Sphinx documentation.
Finally, there is IPython Notebook Viewer (nbviewer), which is a free web service where you can both post and go through static, HTML versions of notebook files hosted on remote servers (these servers are currently donated by Rackspace). Thus, if you create an amazing .ipynb file that you want to share, you can upload it to http://nbviewer.ipython.org/ and let the world see your efforts.
We will try not to sing too loudly the praises of Markdown, but if you are unfamiliar with the tool, we strongly suggest that you try it out. Markdown is actually two different things: a syntax for formatting plain text in a way that can be easily converted to a structured document and a software tool that converts said text into HTML and other languages. Basically, Markdown enables the author to use any desired simple text editor (VI, VIM, Emacs, Sublime editor, TextWrangler, Crimson Editor, or Notepad) that can capture plain text yet still describe relatively complex structures such as different levels of headers, ordered and unordered lists, and block quotes as well as some formatting such as bold and italics. Markdown basically offers a very human-readable version of HTML that is similar to JSON and offers a very human-readable data format.
- IPython Notebook at http://ipython.org/notebook.html
- The IPython Notebook documentation at http://ipython.org/ipython-doc/stable/interactive/notebook.html
- An interesting IPython Notebook collection at https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks
- The IPython Notebook development retrospective at http://blog.fperez.org/2012/01/ipython-notebook-historical.html
- Setting up a remote IPython Notebook server at http://nbviewer.ipython.org/github/Unidata/tds-python-workshop/blob/master/ipython-notebook-server.ipynb
- The Markdown home page at https://daringfireball.net/projects/markdown/basics
Preparing to analyze automobile fuel efficiencies
In this recipe, we are going to start our Python-based analysis of the automobile fuel efficiencies data.
If you completed the first installation successfully, you should be ready to get started.
How to do it…
The following steps will see you through setting up your working directory and IPython for the analysis for this article:
- Create a project directory called fuel_efficiency_python.
- Download the automobile fuel efficiency dataset from http://fueleconomy.gov/feg/epadata/vehicles.csv.zip and store it in the preceding directory. Extract the vehicles.csv file from the zip file into the same directory.
- Open a terminal window and change the current directory (cd) to the fuel_efficiency_python directory.
- At the terminal, type the following command:
- Once the new page has loaded in your web browser, click on New Notebook.
- Click on the current name of the notebook, which is untitled0, and enter in a new name for this analysis (mine is fuel_efficiency_python).
- Let’s use the top-most cell for import statements. Type in the following commands:
import pandas as pd
import numpy as np
from ggplot import *
Then, hit Shift + Enter to execute the cell. This imports both the pandas and numpy libraries, assigning them local names to save a few characters while typing commands. It also imports the ggplot library. Please note that using the from ggplot import * command line is not a best practice in Python and pours the ggplot package contents into our default namespace. However, we are doing this so that our ggplot syntax most closely resembles the R ggplot2 syntax, which is strongly not Pythonic. Finally, we use a magic command to tell IPython Notebook that we want matploblib graphs to render in the notebook.
- In the next cell, let’s import the data and look at the first few records:
vehicles = pd.read_csv("vehicles.csv")
Then, press Shift + Enter. The following text should be shown:
However, notice that a red warning message appears as follows:
DtypeWarning: Columns (22,23,70,71,72,73) have mixed types.
Specify dtype option on import or set low_memory=False. data
This tells us that columns 22, 23, 70, 71, 72, and 73 contain mixed data types. Let’s find the corresponding names using the following commands:
column_names = vehicles.columns.values column_names[[22, 23, 70, 71, 72, 73]] array([cylinders, displ, fuelType2, rangeA, evMotor, mfrCode], dtype=object)
Mixed data types sounds like it could be problematic so make a mental note of these column names. Remember, data cleaning and wrangling often consume 90 percent of project time.
How it works…
With this recipe, we are simply setting up our working directory and creating a new IPython Notebook that we will use for the analysis. We have imported the pandas library and very quickly read the vehicles.csv data file directly into a data frame. Speaking from experience, pandas’ robust data import capabilities will save you a lot of time.
Although we imported data directly from a comma-separated value file into a data frame, pandas is capable of handling many other formats, including Excel, HDF, SQL, JSON, Stata, and even the clipboard using the reader functions. We can also write out the data from data frames in just as many formats using writer functions accessed from the data frame object.
Using the bound method head that is part of the Data Frame class in pandas, we have received a very informative summary of the data frame, including a per-column count of non-null values and a count of the various data types across the columns.
The data frame is an incredibly powerful concept and data structure. Thinking in data frames is critical for many data analyses yet also very different from thinking in array or matrix operations (say, if you are coming from MATLAB or C as your primary development languages).
With the data frame, each column represents a different variable or characteristic and can be a different data type, such as floats, integers, or strings. Each row of the data frame is a separate observation or instance with its own set of values. For example, if each row represents a person, the columns could be age (an integer) and gender (a category or string). Often, we will want to select the set of observations (rows) that match a particular characteristic (say, all males) and examine this subgroup. The data frame is conceptually very similar to a table in a relational database.
- Data structures in pandas at http://pandas.pydata.org/pandas-docs/stable/dsintro.html
- Data frames in R at http://www.r-tutor.com/r-introduction/data-frame
Exploring and describing the fuel efficiency data with Python
Now that we have imported the automobile fuel efficiency dataset into IPython and witnessed the power of pandas, the next step is to replicate the preliminary analysis performed in R, getting your feet wet with some basic pandas functionality.
We will continue to grow and develop the IPython Notebook that we started in the previous recipe. If you’ve completed the previous recipe, you should have everything you need to continue.
How to do it…
- First, let’s find out how many observations (rows) are in our data using the following command:
If you switch back and forth between R and Python, remember that in R, the function is length and in Python, it is len.
- Next, let’s find out how many variables (columns) are in our data using the following command:
- Let’s get a list of the names of the columns using the following command:
Index([u'barrels08', u'barrelsA08', u'charge120',
u'charge240', u'city08', u'city08U', u'cityA08', u'cityA08U',
u'cityCD', u'cityE', u'cityUF', u'co2', u'co2A',
u'co2TailpipeAGpm', u'co2TailpipeGpm', u'comb08', u'comb08U',
u'combA08', u'combA08U', u'combE', u'combinedCD',
u'combinedUF', u'cylinders', u'displ', u'drive', u'engId',
u'eng_dscr', u'feScore', u'fuelCost08', u'fuelCostA08',
u'fuelType', u'fuelType1', u'ghgScore', u'ghgScoreA',
u'highway08', u'highway08U', u'highwayA08', u'highwayA08U',
u'highwayCD', u'highwayE', u'highwayUF', u'hlv', u'hpv',
u'id', u'lv2', u'lv4', u'make', u'model', u'mpgData',
u'phevBlended', u'pv2', u'pv4', u'range', u'rangeCity',
u'rangeCityA', u'rangeHwy', u'rangeHwyA', u'trany', u'UCity',
u'UCityA', u'UHighway', u'UHighwayA', u'VClass', u'year',
u'youSaveSpend', u'guzzler', u'trans_dscr', u'tCharger',
u'sCharger', u'atvType', u'fuelType2', u'rangeA', u'evMotor',
The u letter in front of each string indicates that the strings are represented in Unicode (http://docs.python.org/2/howto/unicode.html)
- Let’s find out how many unique years of data are included in this dataset and what the first and last years are using the following command:
len(pd.unique(vehicles.year)) 31 min(vehicles.year) 1984 max(vehicles["year"]) 2014
Note that again, we have used two different syntaxes to reference individual columns within the vehicles data frame.
- Next, let’s find out what types of fuel are used as the automobiles’ primary fuel types. In R, we have the table function that will return a count of the occurrences of a variable’s various values. In pandas, we use the following:
Regular Gasoline 24587
Premium Gasoline 8521
Natural Gas 57
Midgrade Gasoline 41
- Now if we want to explore what types of transmissions these automobiles have, we immediately try the following command:
However, this results in a bit of unexpected and lengthy output:
What we really want to know is the number of cars with automatic and manual transmissions. We notice that the trany variable always starts with the letter A when it represents an automatic transmission and M for manual transmission. Thus, we create a new variable, trany2, that contains the first character of the trany variable, which is a string:
vehicles["trany2"] = vehicles.trany.str
The preceding command yields the answer that we wanted or twice as many automatics as manuals:
How it works…
In this recipe, we looked at some basic functionality in Python and pandas. We have used two different syntaxes (vehicles[‘trany’] and vehicles.trany) to access variables within the data frame. We have also used some of the core pandas functions to explore the data, such as the incredibly useful unique and the value_counts function.
In terms of the data science pipeline, we have touched on two stages in a single recipe: data cleaning and data exploration. Often, when working with smaller datasets where the time to complete a particular action is quite short and can be completed on our laptop, we will very quickly go through multiple stages of the pipeline and then loop back, depending on the results. In general, the data science pipeline is a highly iterative process. The faster we can accomplish steps, the more iterations we can fit into a fixed time, and often, we can create a better final analysis.
- The pandas API overview at http://pandas.pydata.org/pandas-docs/stable/api.html
This article took you through the process of analyzing and visualizing automobile data to identify trends and patterns in fuel efficiency over time using the powerful programming language, Python.
Resources for Article:
Further resources on this subject: