Working with pandas DataFrames

14 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from the book Python Data Analysis – Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box]

The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article.

The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement.

We will follow these conventions in this text.

In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame.

Installing and exploring pandas

The minimal dependency set requirements for pandas is given as follows:

NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters

python-dateutil: This is a date handling library

pytz: This handles time zone definitions

This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html.

The command to install pandas with pip is as follows:

$ pip3 install pandas rpy2

rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn’t have sufficient rights.

The pandas DataFrames

A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to

http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the

following ways:

Using another DataFrame.
Using a NumPy array or a composite of arrays that has a two-dimensional shape. Likewise, we can create a DataFrame out of another pandas data structure called Series. We will learn about Series in the following section.
A DataFrame can also be produced from a file, such as a CSV file.
From a dictionary of one-dimensional structures, such as one-dimensional NumPy arrays, lists, dicts, or pandas Series.

As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header:

Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088

In the next steps, we will take a look at pandas DataFrames and its attributes:

To kick off, load the data file into a DataFrame and print it on the screen:

from pandas.io.parsers import read_csv

df = read_csv("WHO_first9cols.csv")

print("Dataframe", df)

The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines:

199 21732.0

200 11696.0

201 13228.0

[202 rows x 9 columns]

The DataFrame has an attribute that holds its shape as a tuple, similar to ndarray. Query the number of rows of a DataFrame as follows:

print("Shape", df.shape)

print("Length", len(df))

The values we obtain comply with the printout of the preceding step:

Shape (202, 9)

Length 202

Check the column header and data types with the other attributes:

print("Column Headers", df.columns) print("Data types", df.dtypes)

We receive the column headers in a special data structure:

Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)',

u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object')

The data types are printed as follows:

4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows:

Print("Index", df.index)

An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row:

Sometimes, we wish to iterate over the underlying data of a DataFrame. Iterating over column values can be inefficient if we utilize the pandas iterators. It’s much better to extract the underlying NumPy arrays and work with those. The pandas DataFrame has an attribute that can aid with this as well:

print("Values", df.values)

Please note that some values are designated nan in the output, for ‘not a number’. These values come from empty fields in the input datafile:

The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book.

Querying data in pandas

Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from

http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command:

$ pip3 install Quandl

Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/.

Without further preamble, let’s take a look at how to query data in a pandas DataFrame:

As a first step, we obviously have to download the data. After importing the Quandl API, get the data as follows:

import quandl

# Data from

http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual

# PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A")

The head() and tail() methods have a purpose similar to that of the Unix commands with the same name. Select the first n and last n records of a DataFrame, where n is an integer parameter:

print("Head 2", sunspots.head(2) )

print("Tail 2", sunspots.tail(2))

This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset):

Head 2          Number

Year

1700-12-31      5

1701-12-31     11

[2 rows x 1 columns]

Tail 2          Number

Year

2012-12-31 57.7

2013-12-31 64.9

[2 rows x 1 columns]

Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.

The following is the query for the last value using the last date:

last_date = sunspots.index[-1]

print("Last value", sunspots.loc[last_date])

You can check the following output with the result from the previous step:

Last value Number        64.9

Name: 2013-12-31 00:00:00, dtype: float64

Query the date with date strings in the YYYYMMDD format as follows:

print("Values slice by date:n", sunspots["20020101": "20131231"])

This gives the records from 2002 through to 2013:

Values slice by date                             Number Year

2002-12-31     104.0 [TRUNCATED]

2013-12-31       64.9

[12 rows x 1 columns]

A list of indices can be used to query as well:

print("Slice from a list of indices:n", sunspots.iloc[[2, 4,

-4, -2]])

The preceding code selects the following rows:

Slice from a list of indices                              Number Year

1702-12-31       16.0

1704-12-31       36.0

2010-12-31       16.0

2012-12-31       57.7

[4 rows x 1 columns]

To select scalar values, we have two options. The second option given here should be faster. Two integers are required, the first for the row and the second for the column:

print("Scalar with Iloc:", sunspots.iloc[0, 0])

print("Scalar with iat", sunspots.iat[1, 0])

This gives us the first and second values of the dataset as scalars:

Scalar with Iloc 5.0 Scalar with iat 11.0

Querying with Booleans works much like the Where clause of SQL. The following code queries for values larger than the arithmetic mean. Note that there is a difference between when we perform the query on the whole DataFrame and when we perform it on a single column:

print("Boolean selection", sunspots[sunspots > sunspots.mean()])

print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()])

The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean:

Boolean selection                             Number Year

1700-12-31          NaN [TRUNCATED]

1759-12-31       54.0

...

[314 rows x 1 columns]

Boolean selection with column label                              Number



Year

1705-12-31       58.0 [TRUNCATED]

1870-12-31     139.1

...

[127 rows x 1 columns]

The preceding example code is in the ch_03.ipynb file of this book’s code bundle.

Data aggregation with pandas DataFrames

Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features:

Seed the NumPy random generator to make sure that the generated data will not differ between repeated program runs. The data will have four columns:

Weather (a string)

Food (also a string)

Price (a random float)

Number (a random integer between one and nine)

The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:

import pandas as pd

from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np

seed(42)

df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'],

'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'],

'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)})

print(df)

You should get an output similar to the following:

Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string.

Group the data by the Weather column and then iterate through the groups as follows:

weather_group = df.groupby('Weather')

i = 0

for name, group in weather_group: i = i + 1

print("Group", i, name) print(group)

We have two types of weather, hot and cold, so we get two groups:

The weather_group variable is a special pandas object that we get as a result of the groupby() method. This object has aggregation methods, which are demonstrated as follows:

print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean())

The preceding code snippet prints the first row, last row, and mean of each group:

Just as in a database query, we are allowed to group on multiple columns. The groups attribute will then tell us the groups that are formed, as well as the rows in each group:

wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups)

For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:

WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'):

[2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1],

('cold', 'soup'): [0, 6]

5. Apply a list of NumPy functions on groups with the agg() method:

print("WF Aggregatedn", wf_group.agg([np.mean, np.median]))

Obviously, we could apply even more functions, but it would look messier than the following output:

Concatenating and appending DataFrames

The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let’s select the first three rows:

print("df :3n", df[:3])

Check that these are indeed the first three rows:

df :3 Food Number        Price Weather

0           soup              8 3.745401       cold

1           soup              5 9.507143         hot

2 icecream              4 7.319939       cold

The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame:

print("Concat Back togethern", pd.concat([df[:3], df[3:]]))

The concatenation output appears as follows:

Concat Back together

Food Number Price Weather

0 soup 8 3.745401 cold

1 soup 5 9.507143 hot

2 icecream 4 7.319939 cold

3 chocolate 8 5.986585 hot

4 icecream 8 1.560186 cold

5 icecream 3 1.559945 hot

6 soup 6 0.580836 cold

[7 rows x 4 columns]

To append rows, use the append() function:

print("Appending rowsn", df[:3].append(df[5:]))

The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it:

Appending rows

Food Number Price Weather

0 soup 8 3.745401 cold

1 soup 5 9.507143 hot

2 icecream 4 7.319939 cold

5 icecream 3 1.559945 hot

6 soup 6 0.580836 cold

[5 rows x 4 columns]

Joining DataFrames

To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination:

EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam

Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn’t seem realistic, please feel free to come up with your own story):

EmpNr,Amount5,109,57,2.5

Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join.

An inner join selects rows from two tables, if and only if values match, for columns specified in the join condition. Outer joins do not require a match, and can potentially return more rows. More information on joins can be found at http://en.wikipedia.org/wiki/Join_%28SQL%29.

All these join types are supported by pandas, but we will only take a look at inner joins and full outer joins:

A join on the employee number with the merge() function is performed as follows:

print("Merge() on keyn", pd.merge(dests, tips, on='EmpNr'))

This gives an inner join as the outcome:

Merge() on key

EmpNr            Dest           Amount

0 5 The Hague 10

1 9 Rotterdam 5

[2 rows x 3 columns]

Joining with the join() method requires providing suffixes for the left and right operands:

print("Dests join() tipsn", dests.join(tips, lsuffix='Dest', rsuffix='Tips'))

This method call joins index values so that the result is different from an SQL inner join:

Dests join() tips

EmpNrDest Dest EmpNrTips Amount

0 5 The Hague 5 10.0

1 3 Amsterdam 9 5.0

2 9 Rotterdam 7 2.5

[3 rows x 4 columns]

An even more explicit way to execute an inner join with merge() is as follows:

print("Inner join with merge()n", pd.merge(dests, tips, how='inner'))

The output is as follows:

Inner join with merge()

EmpNr            Dest           Amount

0 5 The Hague 10

1 9 Rotterdam 5

[2 rows x 3 columns]

To make this a full outer join requires only a small change:

print("Outer joinn", pd.merge(dests, tips, how='outer'))

The outer join adds rows with NaN values:

Outer join

EmpNr            Dest            Amount

0 5 The Hague 10.0

1 3 Amsterdam NaN

2 9 Rotterdam 5.0

3 7 NaN            2.5

[4 rows x 3 columns]

In a relational database query, these values would have been set to NULL. The demo code is in the ch-03.ipynb file of this book’s code bundle.

We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas.

If you found this post useful, check out the book Python Data Analysis – Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.

Sugandha Lahoti

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.