Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Working with large data sources

Save for later
  • 20 min read
  • 08 Jul 2015

article-image

In this article, by Duncan M. McGreggor, author of the book Mastering matplotlib, we come across the use of NumPy in the world of matplotlib and big data, problems with large data sources, and the possible solutions to these problems.

(For more resources related to this topic, see here.)


Most of the data that users feed into matplotlib when generating plots is from NumPy. NumPy is one of the fastest ways of processing numerical and array-based data in Python (if not the fastest), so this makes sense. However by default, NumPy works on in-memory database. If the dataset that you want to plot is larger than the total RAM available on your system, performance is going to plummet.

In the following section, we're going to take a look at an example that illustrates this limitation. But first, let's get our notebook set up, as follows:

In [1]: import matplotlib
       matplotlib.use('nbagg')
       %matplotlib inline


Here are the modules that we are going to use:

In [2]: import glob, io, math, os
        import psutil
       import numpy as np
       import pandas as pd
       import tables as tb
       from scipy import interpolate
       from scipy.stats import burr, norm
       import matplotlib as mpl
       import matplotlib.pyplot as plt
       from IPython.display import Image


We'll use the custom style sheet that we created earlier, as follows:

In [3]: plt.style.use("../styles/superheroine-2.mplstyle")

An example problem


To keep things manageable for an in-memory example, we're going to limit our generated dataset to 100 million points by using one of SciPy's many statistical distributions, as follows:

In [4]: (c, d) = (10.8, 4.2)
       (mean, var, skew, kurt) = burr.stats(c, d, moments='mvsk')


The Burr distribution, also known as the Singh–Maddala distribution, is commonly used to model household income. Next, we'll use the burr object's method to generate a random population with our desired count, as follows:

In [5]: r = burr.rvs(c, d, size=100000000)


Creating 100 million data points in the last call took about 10 seconds on a moderately recent workstation, with the RAM usage peaking at about 2.25 GB (before the garbage collection kicked in).

Let's make sure that it's the size we expect, as follows:

In [6]: len(r)
Out[6]: 100000000


If we save this to a file, it weighs in at about three-fourths of a gigabyte:

In [7]: r.tofile("../data/points.bin")
In [8]: ls -alh ../data/points.bin
       -rw-r--r-- 1 oubiwann staff 763M Mar 20 11:35 points.bin


This actually does fit in the memory on a machine with a RAM of 8 GB, but generating much larger files tends to be problematic. We can reuse it multiple times though, to reach a size that is larger than what can fit in the system RAM.

Before we do this, let's take a look at what we've got by generating a smooth curve for the probability distribution, as follows:

In [9]: x = np.linspace(burr.ppf(0.0001, c, d),
                         burr.ppf(0.9999, c, d), 100)
         y = burr.pdf(x, c, d)
In [10]: (figure, axes) = plt.subplots(figsize=(20, 10))
         axes.plot(x, y, linewidth=5, alpha=0.7)
         axes.hist(r, bins=100, normed=True)
         plt.show()


The following plot is the result of the preceding code:

working-large-data-sources-img-0

Our plot of the Burr probability distribution function, along with the 100-bin histogram with a sample size of 100 million points, took about 7 seconds to render. This is due to the fact that NumPy handles most of the work, and we only displayed a limited number of visual elements. What would happen if we did try to plot all the 100 million points? This can be checked by the following code:

In [11]: (figure, axes) = plt.subplots()
         axes.plot(r)
         plt.show()
formatters.py:239: FormatterWarning:
Exception in image/png formatter: Allocated too many blocks


After about 30 seconds of crunching, the preceding error was thrown—the Agg backend (a shared library) simply couldn't handle the number of artists required to render all the points.

But for now, this case clarifies the point that we stated a while back—our first plot rendered relatively quickly because we were selective about the data we chose to present, given the large number of points with which we are working.

However, let's say we have data from the files that are too large to fit into the memory. What do we do about this? Possible ways to address this include the following:

  • Moving the data out of the memory and into the filesystem
  • Moving the data off the filesystem and into the databases


We will explore examples of these in the following section.

Big data on the filesystem


The first of the two proposed solutions for large datasets involves not burdening the system memory with data, but rather leaving it on the filesystem. There are several ways to accomplish this, but the following two methods in particular are the most common in the world of NumPy and matplotlib:

  • NumPy's memmap function: This function creates memory-mapped files that are useful if you wish to access small segments of large files on the disk without having to read the whole file into the memory.
  • PyTables: This is a package that is used to manage hierarchical datasets. It is built on the top of the HDF5 and NumPy libraries and is designed to efficiently and easily cope with extremely large amounts of data.


We will examine each in turn.

NumPy's memmap function


Let's restart the IPython kernel by going to the IPython menu at the top of notebook page, selecting Kernel, and then clicking on Restart. When the dialog box pops up, click on Restart. Then, re-execute the first few lines of the notebook by importing the required libraries and getting our style sheet set up.

Once the kernel is restarted, take a look at the RAM utilization on your system for a fresh Python process for the notebook:

In [4]: Image("memory-before.png")
Out[4]:


The following screenshot shows the RAM utilization for a fresh Python process:

working-large-data-sources-img-1

Now, let's load the array data that we previously saved to disk and recheck the memory utilization, as follows:

In [5]: data = np.fromfile("../data/points.bin")
       data_shape = data.shape
       data_len = len(data)
       data_len
Out[5]: 100000000
In [6]: Image("memory-after.png")
Out[6]:


The following screenshot shows the memory utilization after loading the array data:

working-large-data-sources-img-2

This took about five seconds to load, with the memory consumption equivalent to the file size of the data. This means that if we wanted to build some sample data that was too large to fit in the memory, we'd need about 11 of those files concatenated, as follows:

In [7]: 8 * 1024
Out[7]: 8192
In [8]: filesize = 763
       8192 / filesize
Out[8]: 10.73656618610747


However, this is only if the entire memory was available. Let's see how much memory is available right now, as follows:

In [9]: del data
In [10]: psutil.virtual_memory().available / 1024**2
Out[10]: 2449.1796875


That's 2.5 GB. So, to overrun our RAM, we'll just need a fraction of the total. This is done in the following way:

In [11]: 2449 / filesize
Out[11]: 3.2096985583224114


The preceding output means that we only need four of our original files to create a file that won't fit in memory. However, in the following section, we will still use 11 files to ensure that data, if loaded into the memory, will be much larger than the memory.

How do we create this large file for demonstration purposes (knowing that in a real-life situation, the data would already be created and potentially quite large)? We can try to use numpy.tile to create a file of the desired size (larger than memory), but this can make our system unusable for a significant period of time. Instead, let's use numpy.memmap, which will treat a file on the disk as an array, thus letting us work with data that is too large to fit into the memory.

Let's load the data file again, but this time as a memory-mapped array, as follows:

In [12]: data = np.memmap(
           "../data/points.bin", mode="r", shape=data_shape)


The loading of the array to a memmap object was very quick (compared to the process of bringing the contents of the file into the memory), taking less than a second to complete. Now, let's create a new file to write the data to. This file must be larger in size as compared to our total system memory (if held on in-memory database, it will be smaller on the disk):

In [13]: big_data_shape = (data_len * 11,)
         big_data = np.memmap(
             "../data/many-points.bin", dtype="uint8",
             mode="w+", shape=big_data_shape)


The preceding code creates a 1 GB file, which is mapped to an array that has the shape we requested and just contains zeros:

In [14]: ls -alh ../data/many-points.bin
         -rw-r--r-- 1 oubiwann staff 1.0G Apr 2 11:35 many-points.bin
In [15]: big_data.shape
Out[15]: (1100000000,)
In [16]: big_data
Out[16]: memmap([0, 0, 0, ..., 0, 0, 0], dtype=uint8)


Now, let's fill the empty data structure with copies of the data we saved to the 763 MB file, as follows:

In [17]: for x in range(11):
             start = x * data_len
             end = (x * data_len) + data_len
             big_data[start:end] = data
         big_data
Out[17]: memmap([ 90, 71, 15, ..., 33, 244, 63], dtype=uint8)


If you check your system memory before and after, you will only see minimal changes, which confirms that we are not creating an 8 GB data structure on in-memory. Furthermore, checking your system only takes a few seconds.

Now, we can do some sanity checks on the resulting data and ensure that we have what we were trying to get, as follows:

In [18]: big_data_len = len(big_data)
         big_data_len
Out[18]: 1100000000
In [19]: data[100000000 – 1]
Out[19]: 63
In [20]: big_data[100000000 – 1]
Out[20]: 63


Attempting to get the next index from our original dataset will throw an error (as shown in the following code), since it didn't have that index:

In [21]: data[100000000]
-----------------------------------------------------------
IndexError               Traceback (most recent call last)
...
IndexError: index 100000000 is out of bounds ...


But our new data does have an index, as shown in the following code:

In [22]: big_data[100000000
Out[22]: 90
And then some:
In [23]: big_data[1100000000 – 1]
Out[23]: 63


We can also plot data from a memmaped array without having a significant lag time. However, note that in the following code, we will create a histogram from 1.1 million points of data, so the plotting won't be instantaneous:

In [24]: (figure, axes) = plt.subplots(figsize=(20, 10))
         axes.hist(big_data, bins=100)
         plt.show()


The following plot is the result of the preceding code:

working-large-data-sources-img-3

The plotting took about 40 seconds to generate.

The odd shape of the histogram is due to the fact that, with our data file-hacking, we have radically changed the nature of our data since we've increased the sample size linearly without regard for the distribution. The purpose of this demonstration wasn't to preserve a sample distribution, but rather to show how one can work with large datasets. What we have seen is not too shabby. Thanks to NumPy, matplotlib can work with data that is too large for memory, even if it is a bit slow iterating over hundreds of millions of data points from the disk.

Can matplotlib do better?

HDF5 and PyTables


A commonly used file format in the scientific computing community is Hierarchical Data Format (HDF). HDF is a set of file formats (namely HDF4 and HDF5) that were originally developed at the National Center for Supercomputing Applications (NCSA), a unit of the University of Illinois at Urbana-Champaign, to store and organize large amounts of numerical data.

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime

The NCSA is a great source of technical innovation in the computing industry—a Telnet client, the first graphical web browser, a web server that evolved into the Apache HTTP server, and HDF, which is of particular interest to us, were all developed here. It is a little known fact that NCSA's web browser code was the ancestor to both the Netscape web browser as well as a prototype of Internet Explorer that was provided to Microsoft by a third party.


HDF is supported by Python, R, Julia, Java, Octave, IDL, and MATLAB, to name a few. HDF5 offers significant improvements and useful simplifications over HDF4. It uses B-trees to index table objects and, as such, works well for write-once/read-many time series data. Common use cases span fields such as meteorological studies, biosciences, finance, and aviation. The HDF5 files of multiterabyte sizes are common in these applications. Its typically constructed from the analyses of multiple HDF5 source files, thus providing a single (and often extensive) source of grouped data for a particular application.

The PyTables library is built on the top of the Python HDF5 library and NumPy. As such, it not only provides access to one of the most widely used large data file formats in the scientific computing community, but also links data extracted from these files with the data types and objects provided by the fast Python numerical processing library.

PyTables is also used in other projects. Pandas wraps PyTables, thus extending its convenient in-memory data structures, functions, and objects to large on-disk files. To use HDF data with Pandas, you'll want to create pandas.HDFStore, read from the HDF data sources with pandas.read_hdf, or write to one with pandas.to_hdf. Files that are too large to fit in the memory may be read and written by utilizing chunking techniques. Pandas does support the disk-based DataFrame operations, but these are not very efficient due to the required assembly on columns of data upon reading back into the memory.

One project to keep an eye on under the PyData umbrella of projects is Blaze. It's an open wrapper and a utility framework that can be used when you wish to work with large datasets and generalize actions such as the creation, access, updates, and migration. Blaze supports not only HDF, but also SQL, CSV, and JSON. The API usage between Pandas and Blaze is very similar, and it offers a nice tool for developers who need to support multiple backends.

In the following example, we will use PyTables directly to create an HDF5 file that is too large to fit in the memory (for an 8GB RAM machine). We will follow the following steps:

  • Create a series of CSV source data files that take up approximately 14 GB of disk space
  • Create an empty HDF5 file
  • Create a table in the HDF5 file and provide the schema metadata and compression options
  • Load the CSV source data into the HDF5 table
  • Query the new data source once the data has been migrated


Remember the temperature precipitation data for St. Francis, in Kansas, USA, from a previous notebook? We are going to generate random data with similar columns for the purposes of the HDF5 example. This data will be generated from a normal distribution, which will be used in the guise of the temperature and precipitation data for hundreds of thousands of fictitious towns across the globe for the last century, as follows:

In [25]: head = "country,town,year,month,precip,tempn"
         row = "{},{},{},{},{},{}n"
         filename = "../data/{}.csv"
         town_count = 1000
         (start_year, end_year) = (1894, 2014)
         (start_month, end_month) = (1, 13)
         sample_size = (1 + 2
                       * town_count * (end_year – start_year)
                       * (end_month - start_month))
         countries = range(200)
         towns = range(town_count)
         years = range(start_year, end_year)
         months = range(start_month, end_month)
         for country in countries:
            with open(filename.format(country), "w") as csvfile:
                 csvfile.write(head)
                 csvdata = ""
                 weather_data = norm.rvs(size=sample_size)
                 weather_index = 0
                 for town in towns:
                    for year in years:
                         for month in months:
                             csvdata += row.format(
                                 country, town, year, month,
                                 weather_data[weather_index],
                                 weather_data[weather_index + 1])
                             weather_index += 2
                 csvfile.write(csvdata)


Note that we generated a sample data population that was twice as large as the expected size in order to pull both the simulated temperature and precipitation data at the same time (from the same set). This will take about 30 minutes to run. When complete, you will see the following files:

In [26]: ls -rtm ../data/*.csv
         ../data/0.csv, ../data/1.csv, ../data/2.csv, 
         ../data/3.csv, ../data/4.csv, ../data/5.csv, 
         ...
         ../data/194.csv, ../data/195.csv, ../data/196.csv, 
         ../data/197.csv, ../data/198.csv, ../data/199.csv


A quick look at just one of the files reveals the size of each, as follows:

In [27]: ls -lh ../data/0.csv
         -rw-r--r-- 1 oubiwann staff 72M Mar 21 19:02 ../data/0.csv


With each file that is 72 MB in size, we have data that takes up 14 GB of disk space, which exceeds the size of the RAM of the system in question.

Furthermore, running queries against so much data in the .csv files isn't going to be very efficient. It's going to take a long time. So what are our options? Well, to read this data, HDF5 is a very good fit. In fact, it is designed for jobs like this. We will use PyTables to convert the .csv files to a single HDF5. We'll start by creating an empty table file, as follows:

In [28]: tb_name = "../data/weather.h5t"
         h5 = tb.open_file(tb_name, "w")
         h5
Out[28]: File(filename=../data/weather.h5t, title='', mode='w', 
             root_uep='/', filters=Filters(
                 complevel=0, shuffle=False, fletcher32=False, 
                 least_significant_digit=None))
         / (RootGroup) ''


Next, we'll provide some assistance to PyTables by indicating the data types of each column in our table, as follows:

In [29]: data_types = np.dtype(
             [("country", "<i8"),
             ("town", "<i8"),
             ("year", "<i8"),
             ("month", "<i8"),
              ("precip", "<f8"),
             ("temp", "<f8")])


Also, let's define a compression filter that can be used by PyTables when saving our data, as follows:

In [30]: filters = tb.Filters(complevel=5, complib='blosc')


Now, we can create a table inside our new HDF5 file, as follows:

In [31]: tab = h5.create_table(
             "/", "weather",
             description=data_types,
             filters=filters)


With that done, let's load each CSV file, read it in chunks so that we don't overload the memory, and then append it to our new HDF5 table, as follows:

In [32]: for filename in glob.glob("../data/*.csv"):
         it = pd.read_csv(filename, iterator=True, chunksize=10000)
         for chunk in it:
             tab.append(chunk.to_records(index=False))
            tab.flush()


Depending on your machine, the entire process of loading the CSV file, reading it in chunks, and appending to a new HDF5 table can take anywhere from 5 to 10 minutes.

However, what started out as a collection of the .csv files that weigh in at 14 GB is now a single compressed 4.8 GB HDF5 file, as shown in the following code:

In [33]: h5.get_filesize()
Out[33]: 4758762819


Here's the metadata for the PyTables-wrapped HDF5 table after the data insertion:

In [34]: tab
Out[34]: /weather (Table(288000000,), shuffle, blosc(5)) ''
 description := {
 "country": Int64Col(shape=(), dflt=0, pos=0),
 "town": Int64Col(shape=(), dflt=0, pos=1),
 "year": Int64Col(shape=(), dflt=0, pos=2),
 "month": Int64Col(shape=(), dflt=0, pos=3),
 "precip": Float64Col(shape=(), dflt=0.0, pos=4),
 "temp": Float64Col(shape=(), dflt=0.0, pos=5)}
 byteorder := 'little'
 chunkshape := (1365,)


Now that we've created our file, let's use it. Let's excerpt a few lines with an array slice, as follows:

In [35]: tab[100000:100010]
Out[35]: array([(0, 69, 1947, 5, -0.2328834718674, 0.06810312195695),
         (0, 69, 1947, 6, 0.4724989007889, 1.9529216219569),
         (0, 69, 1947, 7, -1.0757216683235, 1.0415374480545),
         (0, 69, 1947, 8, -1.3700249968748, 3.0971874991576),
         (0, 69, 1947, 9, 0.27279758311253, 0.8263207523831),
         (0, 69, 1947, 10, -0.0475253104621, 1.4530808932953),
         (0, 69, 1947, 11, -0.7555493935762, -1.2665440609117),
         (0, 69, 1947, 12, 1.540049376928, 1.2338186532516),
         (0, 69, 1948, 1, 0.829743501445, -0.1562732708511),
         (0, 69, 1948, 2, 0.06924900463163, 1.187193711598)],
         dtype=[('country', '<i8'), ('town', '<i8'),
               ('year', '<i8'), ('month', '<i8'),
               ('precip', '<f8'), ('temp', '<f8')])
In [36]: tab[100000:100010]["precip"]
Out[36]: array([-0.23288347, 0.4724989 , -1.07572167,
               -1.370025 , 0.27279758, -0.04752531,
               -0.75554939, 1.54004938, 0.8297435 ,
               0.069249 ])


When we're done with the file, we do the same thing that we would do with any other file-like object:

In [37]: h5.close()


If we want to work with it again, simply load it, as follows:

In [38]: h5 = tb.open_file(tb_name, "r")
         tab = h5.root.weather


Let's try plotting the data from our HDF5 file:

In [39]: (figure, axes) = plt.subplots(figsize=(20, 10))
         axes.hist(tab[:1000000]["temp"], bins=100)
         plt.show()


Here's a plot for the first million data points:

working-large-data-sources-img-4

This histogram was rendered quickly, with a much better response time than what we've seen before. Hence, the process of accessing the HDF5 data is very fast. The next question might be "What about executing calculations against this data?" Unfortunately, running the following will consume an enormous amount of RAM:

tab[:]["temp"].mean()


We've just asked for all of the data—all of its 288 million rows. We are going to end up loading everything into the RAM, grinding the average workstation to a halt. Ideally though, when you iterate through the source data and create the HDF5 file, you also crunch the numbers that you will need, adding supplemental columns or groups to the HDF5 file that can be used later by you and your peers.

If we have data that we will mostly be selecting (extracting portions) and which has already been crunched and grouped as needed, HDF5 is a very good fit. This is why one of the most common use cases that you see for HDF5 is the sharing and distribution of the processed data.

However, if we have data that we need to process repeatedly, then we will either need to use another method besides the one that will cause all the data to be loaded into the memory, or find a better match for our data processing needs.

We saw in the previous section that the selection of data is very fast in HDF5. What about calculating the mean for a small section of data? If we've got a total of 288 million rows, let's select a divisor of the number that gives us several hundred thousand rows at a time—2,81,250 rows, to be more precise. Let's get the mean for the first slice, as follows:

In [40]: tab[0:281250]["temp"].mean()
Out[40]: 0.0030696632864265312


This took about 1 second to calculate. What about iterating through the records in a similar fashion? Let's break up the 288 million records into chunks of the same size; this will result in 1024 chunks. We'll start by getting the ranges needed for an increment of 281,250 and then, we'll examine the first and the last row as a sanity check, as follows:

In [41]: limit = 281250
         ranges = [(x * limit, x * limit + limit)
             for x in range(2 ** 10)]
         (ranges[0], ranges[-1])
Out[41]: ((0, 281250), (287718750, 288000000))


Now, we can use these ranges to generate the mean for each chunk of 281,250 rows of temperature data and print the total number of means that we generated to make sure that we're getting our counts right, as follows:

In [42]: means = [tab[x * limit:x * limit + limit]["temp"].mean()
             for x in range(2 ** 10)]
         len(means)
Out[42]: 1024


Depending on your machine, that should take between 30 and 60 seconds. With this work done, it's now easy to calculate the mean for all of the 288 million points of temperature data:

In [43]: sum(means) / len(means)
Out[43]: -5.3051780413782918e-05


Through HDF5's efficient file format and implementation, combined with the splitting of our operations into tasks that would not copy the HDF5 data into memory, we were able to perform calculations across a significant fraction of a billion records in less than a minute. HDF5 even supports parallelization. So, this can be improved upon with a little more time and effort.

However, there are many cases where HDF5 is not a practical choice. You may have some free-form data, and preprocessing it will be too expensive. Alternatively, the datasets may be actually too large to fit on a single machine. This is when you may consider using matplotlib with distributed data.

Summary


In this article, we covered the role of NumPy in the world of big data and matplotlib as well as the process and problems in working with large data sources. Also, we discussed the possible solutions to these problems using NumPy's memmap function and HDF5 and PyTables.

Resources for Article:





Further resources on this subject: