[box type=”note” align=”” class=”” width=””]This article is an excerpt from the book Python Data Analysis – Second Edition written by Armando Fandango. From this book, you will learn how to process and manipulate data with Python for complex data analysis and modeling. Code bundle for this article is hosted on GitHub.[/box]
The popular open source Python library, pandas is named after panel data (an econometric term) and Python data analysis. We shall learn about basic panda functionalities, data structures, and operations in this article.
The official pandas documentation insists on naming the project pandas in all lowercase letters. The other convention the pandas project insists on, is the import pandas as pd import statement.
We will follow these conventions in this text.
In this tutorial, we will install and explore pandas. We will also acquaint ourselves with the a central pandas data structure–DataFrame.
The minimal dependency set requirements for pandas is given as follows:
NumPy: This is the fundamental numerical array package that we installed and covered extensively in the preceding chapters
python-dateutil: This is a date handling library
pytz: This handles time zone definitions
This list is the bare minimum; a longer list of optional dependencies can be located at http://pandas.pydata.org/pandas-docs/stable/install.html. We can install pandas via PyPI with pip or easy_install, using binary installer, with the aid of our operating system package manager, or from the source by checking out the code. The binary installers can be downloaded from http://pandas.pydata.org/getpandas.html.
The command to install pandas with pip is as follows:
$ pip3 install pandas rpy2
rpy2 is an interface to R and is required because rpy is being deprecated. You may have to prepend the preceding command with sudo if your user account doesn’t have sufficient rights.
A pandas DataFrame is a labeled two-dimensional data structure and is similar in spirit to a worksheet in Google Sheets or Microsoft Excel, or a relational database table. The columns in pandas DataFrame can be of different types. A similar concept, by the way, was invented originally in the R programming language. (For more information, refer to
http://www.r-tutor.com/r-introduction/data-frame). A DataFrame can be created in the
following ways:
As an example, we will use data that can be retrieved from http://www.exploredata.net/Downloads/WHO-Data-Set. The original data file is quite large and has many columns, so we will use an edited file instead, which only contains the first nine columns and is called WHO_first9cols.csv; the file is in the code bundle of this book. These are the first two lines, including the header:
Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) totalAfghanistan,1,1,151,28,,,,26088
In the next steps, we will take a look at pandas DataFrames and its attributes:
from pandas.io.parsers import read_csv
df = read_csv("WHO_first9cols.csv")
print("Dataframe", df)
The printout is a summary of the DataFrame. It is too long to be displayed entirely, so we will just grab the last few lines:
199 21732.0
200 11696.0
201 13228.0
[202 rows x 9 columns]
print("Shape", df.shape)
print("Length", len(df))
The values we obtain comply with the printout of the preceding step:
Shape (202, 9)
Length 202
print("Column Headers", df.columns) print("Data types", df.dtypes)
We receive the column headers in a special data structure:
Column Headers Index([u'Country', u'CountryID', u'Continent', u'Adolescent fertility rate (%)', u'Adult literacy rate (%)', u'Gross national income per capita (PPP international $)', u'Net primary school enrolment ratio female (%)',
u'Net primary school enrolment ratio male (%)', u'Population (in thousands) total'], dtype='object')
The data types are printed as follows:
4. The pandas DataFrame has an index, which is like the primary key of relational database tables. We can either specify the index or have pandas create it automatically. The index can be accessed with a corresponding property, as follows:
Print("Index", df.index)
An index helps us search for items quickly, just like the index in this book. In our case, the index is a wrapper around an array starting at 0, with an increment of one for each row:
print("Values", df.values)
Please note that some values are designated nan in the output, for ‘not a number’. These values come from empty fields in the input datafile:
The preceding code is available in Python Notebook ch-03.ipynb, available in the code bundle of this book.
Since a pandas DataFrame is structured in a similar way to a relational database, we can view operations that read data from a DataFrame as a query. In this example, we will retrieve the annual sunspot data from Quandl. We can either use the Quandl API or download the data manually as a CSV file from
http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual. If you want to install the API, you can do so by downloading installers from https://pypi.python.org/pypi/Quandl or by running the following command:
$ pip3 install Quandl
Using the API is free, but is limited to 50 API calls per day. If you require more API calls, you will have to request an authentication key. The code in this tutorial is not using a key. It should be simple to change the code to either use a key or read a downloaded CSV file. If you have difficulties, search through the Python docs at https://docs.python.org/2/. |
Without further preamble, let’s take a look at how to query data in a pandas DataFrame:
import quandl
# Data from
http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual
# PyPi url https://pypi.python.org/pypi/Quandl sunspots = quandl.get("SIDC/SUNSPOTS_A")
print("Head 2", sunspots.head(2) )
print("Tail 2", sunspots.tail(2))
This gives us the first two and last two rows of the sunspot data (for the sake of brevity we have not shown all the columns here; your output will have all the columns from the dataset):
Head 2 Number
Year
1700-12-31 5
1701-12-31 11
[2 rows x 1 columns]
Tail 2 Number
Year
2012-12-31 57.7
2013-12-31 64.9
[2 rows x 1 columns]
Please note that we only have one column holding the number of sunspots per year. The dates are a part of the DataFrame index.
last_date = sunspots.index[-1]
print("Last value", sunspots.loc[last_date])
You can check the following output with the result from the previous step:
Last value Number 64.9
Name: 2013-12-31 00:00:00, dtype: float64
print("Values slice by date:n", sunspots["20020101": "20131231"])
This gives the records from 2002 through to 2013:
Values slice by date Number Year
2002-12-31 104.0 [TRUNCATED]
2013-12-31 64.9
[12 rows x 1 columns]
print("Slice from a list of indices:n", sunspots.iloc[[2, 4,
-4, -2]])
The preceding code selects the following rows:
Slice from a list of indices Number Year
1702-12-31 16.0
1704-12-31 36.0
2010-12-31 16.0
2012-12-31 57.7
[4 rows x 1 columns]
print("Scalar with Iloc:", sunspots.iloc[0, 0])
print("Scalar with iat", sunspots.iat[1, 0])
This gives us the first and second values of the dataset as scalars:
Scalar with Iloc 5.0 Scalar with iat 11.0
print("Boolean selection", sunspots[sunspots > sunspots.mean()])
print("Boolean selection with column label:n", sunspots[sunspots['Number of Observations'] > sunspots['Number of Observations'].mean()])
The notable difference is that the first query yields all the rows, with some rows not conforming to the condition that has a value of NaN. The second query returns only the rows where the value is larger than the mean:
Boolean selection Number Year
1700-12-31 NaN [TRUNCATED]
1759-12-31 54.0
...
[314 rows x 1 columns]
Boolean selection with column label Number
Year
1705-12-31 58.0 [TRUNCATED]
1870-12-31 139.1
...
[127 rows x 1 columns]
The preceding example code is in the ch_03.ipynb file of this book’s code bundle.
Data aggregation is a term used in the field of relational databases. In a database query, we can group data by the value in a column or columns. We can then perform various operations on each of these groups. The pandas DataFrame has similar capabilities. We will generate data held in a Python dict and then use this data to create a pandas DataFrame. We will then practice the pandas aggregation features:
Weather (a string)
Food (also a string)
Price (a random float)
Number (a random integer between one and nine)
The use case is that we have the results of some sort of consumer-purchase research, combined with weather and market pricing, where we calculate the average of prices and keep a track of the sample size and parameters:
import pandas as pd
from numpy.random import seed from numpy.random import rand from numpy.random import rand_int import numpy as np
seed(42)
df = pd.DataFrame({'Weather' : ['cold', 'hot', 'cold','hot', 'cold', 'hot', 'cold'],
'Food' : ['soup', 'soup', 'icecream', 'chocolate', 'icecream', 'icecream', 'soup'],
'Price' : 10 * rand(7), 'Number' : rand_int(1, 9,)})
print(df)
You should get an output similar to the following:
Please note that the column labels come from the lexically ordered keys of the Python dict. Lexical or lexicographical order is based on the alphabetic order of characters in a string.
weather_group = df.groupby('Weather')
i = 0
for name, group in weather_group: i = i + 1
print("Group", i, name) print(group)
We have two types of weather, hot and cold, so we get two groups:
print("Weather group firstn", weather_group.first()) print("Weather group lastn", weather_group.last()) print("Weather group meann", weather_group.mean())
The preceding code snippet prints the first row, last row, and mean of each group:
wf_group = df.groupby(['Weather', 'Food']) print("WF Groups", wf_group.groups)
For each possible combination of weather and food values, a new group is created. The membership of each row is indicated by their index values as follows:
WF Groups {('hot', 'chocolate'): [3], ('cold', 'icecream'):
[2, 4], ('hot', 'icecream'): [5], ('hot', 'soup'): [1],
('cold', 'soup'): [0, 6]
5. Apply a list of NumPy functions on groups with the agg() method:
print("WF Aggregatedn", wf_group.agg([np.mean, np.median]))
Obviously, we could apply even more functions, but it would look messier than the following output:
The pandas DataFrame allows operations that are similar to the inner and outer joins of database tables. We can append and concatenate rows as well. To practice appending and concatenating of rows, we will reuse the DataFrame from the previous section. Let’s select the first three rows:
print("df :3n", df[:3])
Check that these are indeed the first three rows:
df :3 Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
The concat() function concatenates DataFrames. For example, we can concatenate a DataFrame that consists of three rows to the rest of the rows, in order to recreate the original DataFrame:
print("Concat Back togethern", pd.concat([df[:3], df[3:]]))
The concatenation output appears as follows:
Concat Back together
Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
3 chocolate 8 5.986585 hot
4 icecream 8 1.560186 cold
5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold
[7 rows x 4 columns]
To append rows, use the append() function:
print("Appending rowsn", df[:3].append(df[5:]))
The result is a DataFrame with the first three rows of the original DataFrame and the last two rows appended to it:
Appending rows
Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold
[5 rows x 4 columns]
To demonstrate joining, we will use two CSV files-dest.csv and tips.csv. The use case behind it is that we are running a taxi company. Every time a passenger is dropped off at his or her destination, we add a row to the dest.csv file with the employee number of the driver and the destination:
EmpNr,Dest5,The Hague3,Amsterdam9,Rotterdam
Sometimes drivers get a tip, so we want that registered in the tips.csv file (if this doesn’t seem realistic, please feel free to come up with your own story):
EmpNr,Amount5,109,57,2.5
Database-like joins in pandas can be done with either the merge() function or the join() DataFrame method. The join() method joins onto indices by default, which might not be what you want. In SQL a relational database query language we have the inner join, left outer join, right outer join, and full outer join.
|
We learnt how to perform various data manipulation techniques such as aggregating, concatenating, appending, cleaning, and handling missing values, with pandas.
If you found this post useful, check out the book Python Data Analysis – Second Edition to learn advanced topics such as signal processing, textual data analysis, machine learning, and more.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…