Home Data Tutorials Python Data Analysis Utilities

Python Data Analysis Utilities

February 17, 2016 - 12:00 am

3667

13 min read

After the success of the book Python Data Analysis, Packt’s acquisition editor Prachi Bisht gauged the interest of the author, Ivan Idris, in publishing Python Data Analysis Cookbook. According to Ivan, Python Data Analysis is one of his best books. Python Data Analysis Cookbook is meant for a bit more experienced Pythonistas and is written in the cookbook format. In the year after the release of Python Data Analysis, Ivan has received a lot of feedback—mostly positive, as far as he is concerned.

Although Python Data Analysis covers a wide range of topics, Ivan still managed to leave out a lot of subjects. He realized that he needed a library as a toolbox. Named dautil for data analysis utilities, the API was distributed by him via PyPi so that it is installable via pip/easy_install. As you know, Python 2 will no longer be supported after 2020, so dautil is based on Python 3.

For the sake of reproducibility, Ivan also published a Docker repository named pydacbk (for Python Data Analysis Cookbook). The repository represents a virtual image with preinstalled software. For practical reasons, the image doesn’t contain all the software, but it still contains a fair percentage. This article has the following sections:

Data analysis, data science, big data – what is the big deal?
A brief history of data analysis with Python
A high-level overview of dautil
IPython notebook utilities
Downloading data
Plotting utilities
Demystifying Docker
Future directions

(For more resources related to this topic, see here.)

Data analysis, data science, big data – what is the big deal?

You’ve probably seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and computer science. You could perform data analysis with a pen and paper and, in more modern times, with a pocket calculator.

Data analysis has many aspects with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when data warehousing and business intelligence were the buzzwords. The ultimate goal of business intelligence and data warehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric, and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. Data growth is caused by the growth of the world’s population and the rise of new technologies such as social media and mobile devices. Data growth is in fact probably the only trend that we can be sure will continue. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved.

Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since more data will be created in time (and not destroyed), we can expect an increase in automated data analysis.

A brief history of data analysis with Python

The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective:

1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas hobby project.
1995: Jim Hugunin creates Numeric, the predecessor to NumPy.
1999: Pearu Peterson writes f2py as a bridge between Fortran and Python.
2000: Python 2.0 is released.
2001: The SciPy library is released. Also, Numarray, a competing library of Numeric, is created. Fernando Perez releases IPython, which starts out as an afternoon hack. NLTK is released as a research project.
2002: John Hunter creates the matplotlib library.
2005: NumPy is released by Travis Oliphant. Initially, NumPy is Numeric extended with features inspired by Numarray.
2006: NumPy 1.0 is released. The first version of SQLAlchemy is released.
2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython is forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance.
2008: Wes McKinney starts working on pandas. Python 3.0 is released.
2011: The IPython 0.12 release introduces the IPython notebook. Packt releases NumPy 1.5 Beginner’s Guide.
2012: Packt releases NumPy Cookbook.
2013: Packt releases NumPy Beginner’s Guide – Second Edition.
2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt releases Learning NumPy Array and Python Data Analysis.
2015: Packt releases NumPy Beginner’s Guide – Third Edition and NumPy Cookbook – Second Edition.

A high-level overview of dautil

The dautil API that Ivan made for this book is a humble toolbox, which he found useful. It is released under the MIT license. This license is very permissive, so you could in theory use the library in a production system. He doesn’t recommend doing this currently (as of January, 2016), but he believes that the unit tests and documentation are of acceptable quality. The library has 3000+ lines of code and 180+ unit tests with a reasonable coverage. He has fixed as many issues reported by pep8 and flake8 as possible.

Some of the functions in dautil are on the short side and are of very low complexity. This is on purpose. If there is a second edition (knock on wood), dautil will probably be completely transformed. The API evolved as Ivan wrote the book under high time pressure, so some of the decisions he made may not be optimal in retrospect. However, he hopes that people find dautil useful and, ideally, contribute to it.

The dautil modules are summarized in the following table:

Module	Description	LOC
dautil.collect	Contains utilities related to collections	331
dautil.conf	Contains configuration utilities	48
dautil.data	Contains utilities to download and load data	468
dautil.db	Contains database-related utilities	98
dautil.log_api	Contains logging utilities	204
dautil.nb	Contains IPython/Jupyter notebook widgets and utilities	609
dautil.options	Configures dynamic options of several libraries related to data analysis	71
dautil.perf	Contains performance-related utilities	162
dautil.plotting	Contains plotting utilities	382
dautil.report	Contains reporting utilities	232
dautil.stats	Contains statistical functions and utilities	366
dautil.ts	Contains Utilities for time series and dates	217
dautil.web	Contains utilities for web mining and HTML processing	47

IPython notebook utilities

The IPython notebook has become a standard tool for data analysis. The dautil.nb has several interactive IPython widgets to help with Latex rendering, the setting of matplotlib properties, and plotting. Ivan has defined a Context class, which represents the configuration settings of the widgets. The settings are stored in a pretty-printed JSON file in the current working directory, which is named dautil.json. This could be extended, maybe even with a database backend. The following is an edited excerpt (so that it doesn’t take up a lot of space) of an example dautil.json:

{

   ...

   "calculating_moments": {

       "figure.figsize": [ 10.4, 7.7 ],

       "font.size": 11.2

   },

   "calculating_moments.latex": [ 1, 2, 3, 4, 5, 6, 7 ],

   "launching_futures": {

       "figure.figsize": [ 11.5, 8.5 ]

   },

   "launching_futures.labels": [ [ {}, {

               "legend": "loc=best",

               "title": "Distribution of Means"

           }

       ],

       [

           {

               "legend": "loc=best",

               "title": "Distribution of Standard Deviation"

           },

           {

                "legend": "loc=best",

               "title": "Distribution of Skewness"

           }

       ]

   ],

...

}

The Context object can be constructed with a string—Ivan recommends using the name of the notebook, but any unique identifier will do. The dautil.nb.LatexRenderer also uses the Context class. It is a utility class, which helps you number and render Latex equations in an IPython/Jupyter notebook, for instance, as follows:

import dautil as dl

 

lr = dl.nb.LatexRenderer(chapter=12, context=context)

lr.render(r'delta! = x - m')

lr.render(r'm' = m + frac{delta}{n}')

lr.render(r'M_2' = M_2 + delta^2 frac{ n-1}{n}')

lr.render(r'M_3' = M_3 + delta^3 frac{ (n - 1) (n - 2)}{n^2}/

- frac{3delta M_2}{n}')

lr.render(r'M_4' = M_4 + frac{delta^4 (n - 1) /

(n^2 - 3n + 3)}{n^3} + frac{6delta^2 M_2}/

{n^2} - frac{4delta M_3}{n}')

lr.render(r'g_1 = frac{sqrt{n} M_3}{M_2^{3/2}}')

lr.render(r'g_2 = frac{n M_4}{M_2^2}-3.')

The following is the result:

Python Data Analysis Cookbook

Another widget you may find useful is RcWidget, which sets matplotlib settings, as shown in the following screenshot:

Python Data Analysis Cookbook

Downloading data

Sometimes, we require sample data to test an algorithm or prototype a visualization. In the dautil.data module, you will find many utilities for data retrieval. Throughout this book, Ivan has used weather data from the KNMI for the weather station in De Bilt. A couple of the utilities in the module add a caching layer on top of existing pandas functions, such as the ones that download data from the World Bank and Yahoo! Finance (the caching depends on the joblib library and is currently not very configurable). You can also get audio, demographics, Facebook, and marketing data.

The data is stored under a special data directory, which depends on the operating system. On the machine used in the book, it is stored under ~/Library/Application Support/dautil. The following example code loads data from the SPAN Facebook dataset and computes the clique number:

import networkx as nx

import dautil as dl

 

 

fb_file = dl.data.SPANFB().load()

G = nx.read_edgelist(fb_file,

                     create_using=nx.Graph(),

                     nodetype=int)

 

print('Graph Clique Number',

     nx.graph_clique_number(G.subgraph(list(range(2048)))))

To understand what is going on in detail, you will need to read the book. In a nutshell, we load the data and use the NetworkX API to calculate a network metric.

Plotting utilities

Ivan visualizes data very often in the book. Plotting helps us get an idea about how the data is structured and helps you form hypotheses or research questions. Often, we want to chart multiple variables, but we want to easily see what is what. The standard solution in matplotlib is to cycle colors. However, Ivan prefers to cycle line widths and line styles as well. The following unit test demonstrates his solution to this issue:

 def test_cycle_plotter_plot(self):

       m_ax = Mock()

       cp = plotting.CyclePlotter(m_ax)

       cp.plot([0], [0])

       m_ax.plot.assert_called_with([0], [0], '-', lw=1)

       cp.plot([0], [1])

       m_ax.plot.assert_called_with([0], [1], '--', lw=2)

       cp.plot([1], [0])

       m_ax.plot.assert_called_with([1], [0], '-.', lw=1)

The dautil.plotting module currently also has a helper tool for subplots, histograms, regression plots, and dealing with color maps. The following example code (the code for the labels has been omitted) demonstrates a bar chart utility function and a utility function from dautil.data, which downloads stock price data:

import dautil as dl

import numpy as np

import matplotlib.pyplot as plt

 

 

ratios = []

STOCKS = ['AAPL', 'INTC', 'MSFT', 'KO', 'DIS', 'MCD', 'NKE', 'IBM']

 

for symbol in STOCKS:

   ohlc = dl.data.OHLC()

   P = ohlc.get(symbol)['Adj Close'].values

   N = len(P)

   mu = (np.log(P[-1]) - np.log(P[0]))/N

   var_a = 0

   var_b = 0

 

   for k in range(1, N):

       var_a = (np.log(P[k]) - np.log(P[k - 1]) - mu) ** 2

       var_a = var_a / N

 

   for k in range(1, N//2):

       var_b = (np.log(P[2 * k]) - np.log(P[2 * k - 2]) - 2 * mu) ** 2

       var_b = var_b / N

 

   ratios.append(var_b/var_a - 1)

 

_, ax = plt.subplots()

dl.plotting.bar(ax, STOCKS, ratios)

plt.show()

Refer to the following screenshot for the end result:

Python Data Analysis Cookbook

The code performs a random walk test and calculates the corresponding ratio for a list of stock prices. The data is retrieved whenever you run the code, so you may get different results. Some of you have a finance aversion, but rest assured that this book has very little finance-related content.

The following script demonstrates a linear regression utility and caching downloader for World Bank data (the code for the watermark and plot labels has been omitted):

import dautil as dl

import matplotlib.pyplot as plt

import numpy as np

 

 

wb = dl.data.Worldbank()

countries = wb.get_countries()[['name', 'iso2c']]

inf_mort = wb.get_name('inf_mort')

gdp_pcap = wb.get_name('gdp_pcap')

df = wb.download(country=countries['iso2c'],

                 indicator=[inf_mort, gdp_pcap],

                start=2010, end=2010).dropna()

loglog = df.applymap(np.log10)

x = loglog[gdp_pcap]

y = loglog[inf_mort]

 

dl.options.mimic_seaborn()

fig, [ax, ax2] = plt.subplots(2, 1)

ax.set_ylim([0, 200])

ax.scatter(df[gdp_pcap], df[inf_mort])

ax2.scatter(x, y)

dl.plotting.plot_polyfit(ax2, x, y)

plt.show()

The following image should be displayed by the code:

Python Data Analysis Cookbook

The program downloads World Bank data for 2010 and plots the infant mortality rate against the GDP per capita. Also shown is a linear fit of the log-transformed data.

Demystifying Docker

Docker uses Linux kernel features to provide an extra virtualization layer. It was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X as well. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. Ivan’s Docker image, which is mentioned in the introduction, is based on the continuumio/miniconda3 Docker image. The Docker installation docs are at https://docs.docker.com/index.html.

Once you install Boot2Docker, you need to initialize it. This is only necessary once, and Linux users don’t need this step:

$ boot2docker init

The next step for Mac OS X and Windows users is to start the VM:

$ boot2docker start

Check the Docker environment by starting a sample container:

$ docker run hello-world

Docker images are organized in a repository, which resembles GitHub. A producer pushes images and a consumer pulls images. You can pull Ivan’s repository with the following command. The size is currently 387 MB.

$ docker pull ivanidris/pydacbk

Future directions

The dautil API consists of items Ivan thinks will be useful outside of the context of this book. Certain functions and classes that he felt were only suitable for a particular chapter are placed in separate per-chapter modules, such as ch12util.py. In retrospect, parts of those modules may need to be included in dautil as well.

In no particular order, Ivan has the following ideas for future dautil development:

He is playing with the idea of creating a parallel library with “Cythonized” code, but this depends on how dautil is received
Adding more data loaders as required
There is a whole range of streaming (or online) algorithms that he thinks should be included in dautil as well
The GUI of the notebook widgets should be improved and extended
The API should have more configuration options and be easier to configure

Summary

In this article, Ivan roughly sketched what data analysis, data science, and big data are about. This was followed by a brief of history of data analysis with Python. Then, he started explaining dautil—the API he made to help him with this book. He gave a high-level overview and some examples of the IPython notebook utilities, features to download data, and plotting utilities.

He used Docker for testing and giving readers a reproducible data analysis environment, so he spent some time on that topic too. Finally, he mentioned the possible future directions that could be taken for the library in order to guide anyone who wants to contribute.

Resources for Article:

Further resources on this subject:

Recommending Movies at Scale (Python) [article]
Python Data Science Up and Running [article]
Making Your Data Everything It Can Be [article]

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Python Data Analysis Utilities

Data analysis, data science, big data – what is the big deal?

A brief history of data analysis with Python

A high-level overview of dautil

IPython notebook utilities

Downloading data

Plotting utilities

Demystifying Docker

Future directions

Summary

Resources for Article:

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter