Analyzing a Complex Dataset

5 min read

(For more resources related to this topic, see here.)

We may need to analyze volumes of data that are too large for a simple spreadsheet. A document with more than a few hundred rows can rapidly become bewildering. Handling thousands of rows can be very challenging, indeed.

How can we do this more sophisticated analysis? The answer for many such analytic problems is Python. It handles larger sets of data with ease. We can very easily write sophisticated sorting, filtering and calculating rules. We’re not limited by the row-and-column structure, either.

We’ll look at some data that—on the surface—is boring, prosaic budget information. We’ve chosen budgets because they’re relatively simple to understand with a lot of problem domain background. Specifically, we’ll download the budget for the city of Chicago, Illinois. Why Chicago? Chicago has made their data available to the general public, so we can use this real-world data for an example that’s not contrived or simplified. A number of other municipalities have also made similar data available.

How do we proceed? Clearly, step one is to get the data so we can see what we’re working with.

For more information, we can start with this URL:

The city’s data portal offers a number of formats: CSV, JSON, PDF, RDF, RSS, XLS, XLSX, and XML. Of these, the JSON is perhaps the easiest to work with.

We can acquire the appropriation information with the following URL:

This will yield a JSON-format document. We can gather this data and cache it locally with a small Python script.

import urllib.request budget_url= "
rows.json?accessType=DOWNLOAD" with open( "budget_appropriations.json", "w") as target: with urllib.request.urlopen(budget_url) as document: target.write( )

We can use a similar script to gather the salary data that goes with the budget appropriation. The salary information will use this URL:

Clearly, we can create a more general script to download from these two slightly different URL’s to gather two very different JSON files. We’ll focus on the appropriations for this article because the data organization turns out to be simpler.

We could download this data manually using a browser. It rapidly becomes difficult to automate data collection and analysis when we introduce a manual step involving a person pointing and clicking using a browser. A similar comment applies to trying to use a spreadsheet to analyze the data: merely putting it on a computer doesn’t really automate or formalize a process.

The results of step one, then, are two files: budget_appropriations.json and budget_salaries.json.

Parsing the JSON document

Since the data is encoded in JSON, we can simply open the files in our development environment to see what they look like. Informally, it appears that two data sets have some common columns and some distinct columns. We’ll need to create a more useful side-by-side comparison of the two files.

We’ll import the JSON module. We almost always want to pretty-print during exploration, so we’ll import the pprint() function just in case we need it. Here are the first two imports:

import json from pprint import pprint

One thing we may have noted when looking at the JSON is that are two important-looking keys: a ‘meta’ key and a ‘data’ key. The ‘meta’ key is associated with a sequence of column definitions. The ‘data’ object is associated with a sequence of rows of actual data.

We can use a quick script like the following to discover the details of the metadata:

def column_extract( ): for filename in "budget_appropriations.json", "budget_salaries.json": with open(filename) as source: print( filename ) dataset= json.load( source ) for col in dataset['meta']['view']['columns']: if col['dataTypeName'] != "meta_data": print( col['fieldName'], col['dataTypeName'] ) print()

We’ve opened each of our source files, and loaded the JSON document into an internal mapping named ‘dataset‘. The metadata can be found by navigating through the dictionaries that are part of this document. The path is dataset[‘meta’][‘view’][‘columns’]. This leads to a sequence of column definitions.

For each column definition, we can print out two relevant attributes using the keys ‘fieldName‘ and ‘dataTypeName‘. This will reveal items that are dimensions and items that are facts within this big pile of data.

This small function can be used in a short script file to see the various columns involved. We can write a short script like this:

if __name__ == "__main__": column_extract()

We can see that we have columns which are any type number, text, and money. The number and text columns can be termed “dimensions”, they describe the facts. The money columns are the essential facts that we’d like to analyze.

Designing around a user story

Now that we have the data, it makes sense to see where we’re going with it. Our goal is to support queries against salaries and appropriations. Our users want to see various kinds of subtotals and some correlations between the two datasets.

The terms “various kinds” reveal that the final analysis details are open-ended. It’s most important for us to build a useful model of the data rather than solve some very specific problem. Once we have the data in a usable model, we can solve a number of specific problems.

A good approach is to follow Star Schema or Facts and Dimensions design pattern that supports data warehouse design. We’ll decompose each table into facts—measurements that have defined units—and dimensions that describe those facts. We might also call the dimensions attributes of Business Entities that are measured by the facts. The facts in a budget analysis context are almost always money. Almost everything else will be some kind of dimension: time, geography, legal organization, government service, or financial structure.

In the long run, we might like to load a relational database using Python objects. This allows a variety of tools to access the data. It would lead to more complex technology stack. For example, we’d need an Object-Relational Mapping (ORM) layer in addition to a star schema layer.

For now, we’ll populate a pure python model. We’ll show how this model can be extended to support SQLAlchemy as an ORM layer.


Please enter your comment!
Please enter your name here