In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers.
(For more resources related to this topic, see here.)
Python data access in Jupyter
I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset).
Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file.
The Python code is:
# import the datasets package
from sklearn import datasets
# pull in the iris data
iris_dataset = datasets.load_iris()
# grab the first two columns of data
X = iris_dataset.data[:, :2]
# calculate some basic statistics
x_count = len(X.flat)
x_min = X[:, 0].min() - .5
x_max = X[:, 0].max() + .5
x_mean = X[:, 0].mean()
# display our results
x_count, x_min, x_max, x_mean
I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot:
Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed.
It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics.
If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script:
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# calculate some basic statisticsn",
"x_count = len(X.flat)n",
"x_min = X[:, 0].min() - .5n",
"x_max = X[:, 0].max() + .5n",
"x_mean = X[:, 0].mean()n",
"n",
"# display our resultsn",
"x_count, x_min, x_max, x_mean"
]
}
Python pandas in Jupyter
One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter.
I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources.
Here is our Python script that we want to run in Jupyter:
from pandas import *
training_set = read_csv('train.csv')
training_set.head()
male = training_set[training_set.sex == 'male']
female = training_set[training_set.sex =='female']
womens_survival_rate = float(sum(female.survived))/len(female)
mens_survival_rate = float(sum(male.survived))/len(male)
The result is… we calculate the survival rates of the passengers based on sex.
We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results.
Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot:
When I ran this script, I had two problems:
On Windows, it is common to use backslash (“”) to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash (“/”) in my CSV file path. I originally had a full path to the CSV in the above code example.
The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the ‘sex’ field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived.
The final script and result looks like the following screenshot when we run it:
I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers.
If you scroll down, you see the results as shown in the following screenshot:
We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead!
Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present.
Python graphics in Jupyter
How do Python graphics work in Jupyter?
I started another view for this named Python Graphics so as to distinguish the work.
If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data.
The Python coding is simple:
import pandas
import matplotlib
%matplotlib inline
baby_name = ['Alice','Charles','Diane','Edward']
number_births = [96, 155, 66, 272]
dataset = list(zip(baby_name,number_births))
df = pandas.DataFrame(data = dataset, columns=['Name', 'Number'])
df['Number'].plot()
The steps of the script are as follows:
- We import the graphics library (and data library) that we need
- Define our data
- Convert the data into a format that allows for easy graphical display
- Plot the data
We would expect a resultant graph of the number of births by baby name.
Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot:
I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells.
When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses:
And finally we see our plot of the births as shown in the following screenshot.
I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells.
The tabular data display of the DataFrame is stored as HTML—convenient:
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>n",
"<table border="1" class="dataframe">n",
"<thead>n",
"<tr style="text-align: right;">n",
"<th></th>n",
"<th>Name</th>n",
"<th>Number</th>n",
"</tr>n",
"</thead>n",
"<tbody>n",
"<tr>n",
"<th>0</th>n",
"<td>Alice</td>n",
"<td>96</td>n",
"</tr>n",
"<tr>n",
"<th>1</th>n",
"<td>Charles</td>n",
"<td>155</td>n",
"</tr>n",
"<tr>n",
"<th>2</th>n",
"<td>Diane</td>n",
"<td>66</td>n",
"</tr>n",
"<tr>n",
"<th>3</th>n",
"<td>Edward</td>n",
"<td>272</td>n",
"</tr>n",
"</tbody>n",
"</table>n",
"</div>"
],
"text/plain": [
" Name Numbern",
"0 Alice 96n",
"1 Charles 155n",
"2 Diane 66n",
"3 Edward 272"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
The graphic output cell that is stored like this:
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png":
"<a few hundred lines of hexcodes>
…/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n",
"text/plain": [
"<matplotlib.figure.Figure at 0x47d8e30>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# plot the datan",
"df['Number'].plot()n"
]
}
],
Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page.
Python random numbers in Jupyter
For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function.
In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces.
The script we are using is this:
import pylab
import random
random.seed(113)
samples = 1000
dice = []
for i in range(samples):
total = random.randint(1,6) + random.randint(1,6)
dice.append(total)
pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0))
pylab.show()
Once we have the script in Jupyter and execute it, we have this result:
I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease.
The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment.
The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways.
Summary
In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script.
Resources for Article:
Further resources on this subject:
- Python Data Science Up and Running [article]
- Mining Twitter with Python – Influence and Engagement [article]
- Unsupervised Learning [article]