10 min read

The examples are:

  • Plotting data from a database
  • Plotting data from a web page
  • Plotting the data extracted by parsing an Apache log file
  • Plotting the data read from a comma-separated values (CSV) file
  • Plotting extrapolated data using curve fitting
  • Third-party tools using Matplotlib (NetworkX and mpmath)

Let’s begin

Plotting data from a database

Databases often tend to collect much more information than we can simply extract and watch in a tabular format (let’s call it the “Excel sheet” report style).

Databases not only use efficient techniques to store and retrieve data, but they are also very good at aggregating it.

One suggestion we can give is to let the database do the work. For example, if we need to sum up a column, let’s make the database sum the data, and not sum it up in the code. In this way, the whole process is much more efficient because:

  • There is a smaller memory footprint for the Python code, since only the aggregate value is returned, not the whole result set to generate it
  • The database has to read all the rows in any case. However, if it’s smart enough, then it can sum values up as they are read
  • The database can efficiently perform such an operation on more than one column at a time

The data source we’re going to query is from an open source project: the Debian distribution. Debian has an interesting project called UDD , Ultimate Debian Database, which is a relational database where a lot of information (either historical or actual) about the distribution is collected and can be analyzed.

On the project website http://udd.debian.org/, we can fi nd a full dump of the database (quite big, honestly) that can be downloaded and imported into a local PostgreSQL instance (refer to http://wiki.debian.org/UltimateDebianDatabase/CreateLocalReplica for import instructions

Now that we have a local replica of UDD, we can start querying it:

# module to access PostgreSQL databases
import psycopg2
# matplotlib pyplot module
import matplotlib.pyplot as plt

Since UDD is stored in a PostgreSQL database, we need psycopg2 to access it. psycopg2 is a third-party module available at http://initd.org/projects/psycopg

# connect to UDD database
conn = psycopg2.connect(database="udd")
# prepare a cursor
cur = conn.cursor()

We will now connect to the database server to access the udd database instance, and then open a cursor on the connection just created.

# this is the query we'll be making
query = """
select to_char(date AT TIME ZONE 'UTC', 'HH24'), count(*)
from upload_history
where to_char(date, 'YYYY') = '2008'
group by 1
order by 1"""


We have prepared the select statement to be executed on UDD. What we wish to do here is extract the number of packages uploaded to the Debian archive (per hour) in the whole year of 2008.

  • date AT TIME ZONE ‘UTC’: As date field is of the type timestamp with time zone, it also contains time zone information, while we want something independent from the local time. This is the way to get a date in UTC time zone.
  • group by 1: This is what we have encouraged earlier, that is, let the database do the work. We let the query return the already aggregated data, instead of coding it into the program.
# execute the query
cur.execute(query)
# retrieve the whole result set
data = cur.fetchall()

We execute the query and fetch the whole result set from it.

# close cursor and connection
cur.close()
conn.close()

Remember to always close the resources that we’ve acquired in order to avoid memory or resource leakage and reduce the load on the server (removing connections that aren’t needed anymore).

# unpack data in hours (first column) and
# uploads (second column)
hours, uploads = zip(*data)

The query result is a list of tuples, (in this case, hour and number of uploads), but we need two separate lists—one for the hours and another with the corresponding number of uploads. zip() solves this with *data, we unpack the list, returning the sublists as separate arguments to zip(), which in return, aggregates the elements in the same position in the parameters into separated lists. Consider the following example:

In [1]: zip(['a1', 'a2'], ['b1', 'b2'])
Out[1]: [('a1', 'b1'), ('a2', 'b2')]

To complete the code:

# graph code
plt.plot(hours, uploads)
# the the x limits to the 'hours' limit
plt.xlim(0, 23)
# set the X ticks every 2 hours
plt.xticks(range(0, 23, 2))
# draw a grid
plt.grid()
# set title, X/Y labels
plt.title("Debian packages uploads per hour in 2008")
plt.xlabel("Hour (in UTC)")
plt.ylabel("No. of uploads")

The previous code snippet is the standard plotting code, which results in the following screenshot:

Plotting data using Matplotlib: Part 1

From this graph we can see that in 2008, the main part of Debian packages uploads came from European contributors. In fact, uploads were made mainly in the evening hours (European time), after the working days are over (as we can expect from a voluntary project).

Plotting data from the Web

Often, the information we need is not distributed in an easy-to-use format such as XML or a database export but for example only on web sites.

More and more often we find interesting data on a web page, and in that case we have to parse it to extract that information: this is called web scraping .

In this example, we will parse a Wikipedia article to extracts some data to plot. The article is at http://it.wikipedia.org/wiki/Demografia_d’Italia and contains lots of information about Italian demography (it’s in Italian because the English version lacks a lot of data); in particular, we are interested in the population evolution over the years.

Probably the best known Python module for web scraping is BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/). It’s a really nice library that gets the job done quickly, but there are situations (in particular with JavaScript embedded in the web page, such as for Wikipedia) that prevent it from working.

As an alternative, we find lxml quite productive (http://codespeak.net/lxml/). It’s a library mainly used to work with XML (as the name suggests), but it can also be used with HTML (given their quite similar structures), and it is powerful and easy–to-use.

Let’s dig into the code now:

# to get the web pages
import urllib2
# lxml submodule for html parsing
from lxml.html import parse
# regular expression module
import re
# Matplotlib module
import matplotlib.pyplot as plt

Along with the Matplotlib module, we need the following modules:

  • urllib2: This is the module (from the standard library) that is used to access resources through URL (we will download the webpage with this).
  • lxml: This is the parsing library.
  • re: Regular expressions are needed to parse the returned data to extract the information we need. re is a module from the standard library, so we don’t need to install a third-party module to use it.
# general urllib2 config
user_agent = 'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
url = "http://it.wikipedia.org/wiki/Demografia_d'Italia"

Here, we prepare some configuration for urllib2, in particular, the user_agent header is used to access Wikipedia and the URL of the page.

# prepare the request and open the url
req = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(req)

Then we make a request for the URL and get the HTML back.

# we parse the webpage, getroot() return the document root
doc = parse(response).getroot()

We parse the HTML using the parse() function of lxml.html and then we get the root element. XML can be seen as a tree, with a root element (the node at the top of the tree from where every other node descends), and a hierarchical structure of elements.

# find the data table, using css elements
table = doc.cssselect('table.wikitable')[0]

We leverage the structure of HTML accessing the first element of type table of class wikitable because that’s the table we’re interested in.

# prepare data structures, will contain actual data
years = []
people = []

Preparing the lists that will contain the parsed data.

# iterate over the rows of the table, except first and last ones
for row in table.cssselect('tr')[1:-1]:

We can start parsing the table. Since there is a header and a footer in the table, we skip the first and the last line from the lines (selected by the tr tag) to loop over.

# get the row cell (we will use only the first two)
data = row.cssselect('td')

We get the element with the td tag that stands for table data: those are the cells in an HTML table.

# the first cell is the year
tmp_years = data[0].text_content()
# cleanup for cases like 'YYYY[N]' (date + footnote link)
tmp_years = re.sub('[.]', '', tmp_years)

We take the first cell that contains the year, but we need to remove the additional characters (used by Wikipedia to link to footnotes).

# the second cell is the population count
tmp_people = data[1].text_content()
# cleanup from '.', used as separator
tmp_people = tmp_people.replace('.', '')

We also take the second cell that contains the population for a given year. It’s quite common in Italy to separate thousands in number with a ‘.’ character: we have to remove them to have an appropriate value.

# append current data to data lists, converting to integers
years.append(int(tmp_years))
people.append(int(tmp_people))

We append the parsed values to the data lists, explicitly converting them to integer values.

# plot data
plt.plot(years,people)
# ticks every 10 years
plt.xticks(range(min(years), max(years), 10))
plt.grid()
# add a note for 2001 Census
plt.annotate("2001 Census", xy=(2001, people[years.index(2001)]),
xytext=(1986, 54.5*10**6),
arrowprops=dict(arrowstyle='fancy'))

Running the example results in the following screenshot that clearly shows why the annotation is needed:

Plotting data using Matplotlib: Part 1

In 2001, we had a national census in Italy, and that’s the reason for the drop in that year: the values released from the National Institute for Statistics (and reported in the Wikipedia article) are just an estimation of the population. However, with a census, we have a precise count of the people living in Italy.

Plotting data by parsing an Apache log file

Plotting data from a log file can be seen as the art of extracting information from it.

Every service has a log format different from the others. There are some exceptions of similar or same format (for example, for services that come from the same development teams) but then they may be customized and we’re back at the beginning.

The main differences in log files are:

  • Fields orders: Some have time information at the beginning, others in the middle of the line, and so on
  • Fields types: We can find several different data types such as integers, strings, and so on
  • Fields meanings: For example, log levels can have very different meanings

From all the data contained in the log file, we need to extract the information we are interested in from the surrounding data that we don’t need (and hence we skip).

In our example, we’re going to analyze the log file of one of the most common services: Apache. In particular, we will parse the access.log file to extract the total number of hits and amount of data transferred per day.

Apache is highly configurable, and so is the log format. Our Apache configuration, contained in the httpd.conf file, has this log format:

"%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""

This is in LogFormat specification where

Log directive

Description

%h

The host making the request

%l

Identity of the client (which is usually not available)

%u

User making the request (usually not available)

%t

The time the request was received

%r

The request

%>s

The status code

%b

The size (in bytes) of the response sent to the client (excluding the headers)

%{Referer}i

The page from where the requests originated (for example, the HTML page where a PNG image is requested)

%{User-Agent}i

The user agent used to make the request

LEAVE A REPLY

Please enter your comment!
Please enter your name here