27 min read

In this article by Eric Rochester author of the book, Clojure Data Analysis Cookbook, Second Edition, we will cover the following recipes:

  • Loading Incanter’s sample datasets
  • Loading Clojure data structures into datasets
  • Viewing datasets interactively with view
  • Converting datasets to matrices
  • Using infix formulas in Incanter
  • Selecting columns with $
  • Selecting rows with $
  • Filtering datasets with $where
  • Grouping data with $group-by
  • Saving datasets to CSV and JSON
  • Projecting from multiple datasets with $join

(For more resources related to this topic, see here.)

Introduction

Incanter combines the power to do statistics using a fully-featured statistical language such as R (http://www.r-project.org/) with the ease and joy of Clojure.

Incanter’s core data structure is the dataset, so we’ll spend some time in this article to look at how to use them effectively. While learning basic tools in this manner is often not the most exciting way to spend your time, it can still be incredibly useful. At its most fundamental level, an Incanter dataset is a table of rows. Each row has the same set of columns, much like a spreadsheet. The data in each cell of an Incanter dataset can be a string or a numeric. However, some operations require the data to only be numeric.

First you’ll learn how to populate and view datasets, then you’ll learn different ways to query and project the parts of the dataset that you’re interested in onto a new dataset. Finally, we’ll take a look at how to save datasets and merge multiple datasets together.

Loading Incanter’s sample datasets

Incanter comes with a set of default datasets that are useful for exploring Incanter’s functions. I haven’t made use of them in this book, since there is so much data available in other places, but they’re a great way to get a feel of what you can do with Incanter. Some of these datasets—for instance, the Iris dataset—are widely used to teach and test statistical algorithms. It contains the species and petal and sepal dimensions for 50 irises. This is the dataset that we’ll access today.

In this recipe, we’ll load a dataset and see what it contains.

Getting ready

We’ll need to include Incanter in our Leiningen project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

We’ll also need to include the right Incanter namespaces into our script or REPL:

(use '(incanter core datasets))

How to do it…

Once the namespaces are available, we can access the datasets easily:

user=> (def iris (get-dataset :iris))
#'user/iris user=> (col-names iris)
[:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species]
user=> (nrow iris)
150 user=> (set ($ :Species iris))
#{"versicolor" "virginica" "setosa"}

How it works…

We use the get-dataset function to access the built-in datasets. In this case, we’re loading the Fisher’s Iris dataset, sometimes called Anderson’s dataset. This is a multivariate dataset for discriminant analysis. It gives petal and sepal measurements for 150 different Irises of three different species.

Incanter’s sample datasets cover a wide variety of topics—from U.S. arrests to plant growth and ultrasonic calibration. They can be used to test different algorithms and analyses and to work with different types of data.

By the way, the names of functions should be familiar to you if you’ve previously used R. Incanter often uses the names of R’s functions instead of using the Clojure names for the same functions. For example, the preceding code sample used nrow instead of count.

There’s more…

Incanter’s API documentation for get-dataset (http://liebke.github.com/incanter/datasets-api.html#incanter.datasets/get-dataset) lists more sample datasets, and you can refer to it for the latest information about the data that Incanter bundles.

Loading Clojure data structures into datasets

While they are good for learning, Incanter’s built-in datasets probably won’t be that useful for your work (unless you work with irises). Other recipes cover ways to get data from CSV files and other sources into Incanter. Incanter also accepts native Clojure data structures in a number of formats. We’ll take look at a couple of these in this recipe.

Getting ready

We’ll just need Incanter listed in our project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

We’ll also need to include this in our script or REPL:

(use 'incanter.core)

How to do it…

The primary function used to convert data into a dataset is to-dataset. While it can convert single, scalar values into a dataset, we’ll start with slightly more complicated inputs.

  1. Generally, you’ll be working with at least a matrix. If you pass this to to-dataset, what do you get?
    user=> (def matrix-set (to-dataset [[1 2 3] [4 5 6]]))
    #'user/matrix-set user=> (nrow matrix-set)
    2
    user=> (col-names matrix-set)
    [:col-0 :col-1 :col-2]
  2. All the data’s here, but it can be labeled in a better way. Does to-dataset handle maps?
    user=> (def map-set (to-dataset {:a 1, :b 2, :c 3}))
    #'user/map-set user=> (nrow map-set)
    1 user=> (col-names map-set)
    [:a :c :b]
  3. So, map keys become the column labels. That’s much more intuitive. Let’s throw a sequence of maps at it:
    user=> (def maps-set (to-dataset [{:a 1, :b 2, :c 3},
                                     {:a 4, :b 5, :c 6}]))
    #'user/maps-set
    user=> (nrow maps-set)
    2
    user=> (col-names maps-set)
    [:a :c :b]
  4. This is much more useful. We can also create a dataset by passing the column vector and the row matrix separately to dataset:
    user=> (def matrix-set-2
             (dataset [:a :b :c]
                             [[1 2 3] [4 5 6]]))
    #'user/matrix-set-2 user=> (nrow matrix-set-2)
    2 user=> (col-names matrix-set-2)
    [:c :b :a]

How it works…

The to-dataset function looks at the input and tries to process it intelligently. If given a sequence of maps, the column names are taken from the keys of the first map in the sequence.

Ultimately, it uses the dataset constructor to create the dataset. When you want the most control, you should also use the dataset. It requires the dataset to be passed in as a column vector and a row matrix. When the data is in this format or when we need the most control—to rename the columns, for instance—we can use dataset.

Viewing datasets interactively with view

Being able to interact with our data programmatically is important, but sometimes it’s also helpful to be able to look at it. This can be especially useful when you do data exploration.

Getting ready

We’ll need to have Incanter in our project.clj file and script or REPL, so we’ll use the same setup as we did for the Loading Incanter’s sample datasets recipe, as follows. We’ll also use the Iris dataset from that recipe.

(use '(incanter core datasets))

How to do it…

Incanter makes this very easy. Let’s take a look at just how simple it is:

  1. First, we need to load the dataset, as follows:
    user=> (def iris (get-dataset :iris))
    #'user/iris
  2. Then we just call view on the dataset:
    user=> (view iris)

This function returns the Swing window frame, which contains our data, as shown in the following screenshot. This window should also be open on your desktop, although for me, it’s usually hiding behind another window:

How it works…

Incanter’s view function takes any object and tries to display it graphically. In this case, it simply displays the raw data as a table.

Converting datasets to matrices

Although datasets are often convenient, many times we’ll want to treat our data as a matrix from linear algebra. In Incanter, matrices store a table of doubles. This provides good performance in a compact data structure. Moreover, we’ll need matrices many times because some of Incanter’s functions, such as trans, only operate on a matrix. Plus, it implements Clojure’s ISeq interface, so interacting with matrices is also convenient.

Getting ready

For this recipe, we’ll need the Incanter libraries, so we’ll use this project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]])

We’ll use the core and io namespaces, so we’ll load these into our script or REPL:

(use '(incanter core io))

This line binds the file name to the identifier data-file:

(def data-file "data/all_160_in_51.P35.csv")

How to do it…

For this recipe, we’ll create a dataset, convert it to a matrix, and then perform some operations on it:

  1. First, we need to read the data into a dataset, as follows:

    (def va-data (read-dataset data-file :header true))

  2. Then, in order to convert it to a matrix, we just pass it to the to-matrix function. Before we do this, we’ll pull out a few of the columns since matrixes can only contain floating-point numbers:
    (def va-matrix
       (to-matrix ($ [:POP100 :HU100 :P035001] va-data)))
  3. Now that it’s a matrix, we can treat it like a sequence of rows. Here, we pass it to first in order to get the first row, take in order to get a subset of the matrix, and count in order to get the number of rows in the matrix:
    user=> (first va-matrix)
    A 1x3 matrix
    -------------
    8.19e+03 4.27e+03 2.06e+03
     
    user=> (count va-matrix)
    591
  4. We can also use Incanter’s matrix operators to get the sum of each column, for instance. The plus function takes each row and sums each column separately:
    user=> (reduce plus va-matrix)
    A 1x3 matrix
    -------------
    5.43e+06 2.26e+06 1.33e+06

How it works…

The to-matrix function takes a dataset of floating-point values and returns a compact matrix. Matrices are used by many of Incanter’s more sophisticated analysis functions, as they’re easy to work with.

There’s more…

In this recipe, we saw the plus matrix operator. Incanter defines a full suite of these. You can learn more about matrices and see what operators are available at https://github.com/liebke/incanter/wiki/matrices.

Using infix formulas in Incanter

There’s a lot to like about lisp: macros, the simple syntax, and the rapid development cycle. Most of the time, it is fine if you treat math operators as functions and use prefix notations, which is a consistent, function-first syntax. This allows you to treat math operators in the same way as everything else so that you can pass them to reduce, or anything else you want to do.

However, we’re not taught to read math expressions using prefix notations (with the operator first). And especially when formulas get even a little complicated, tracing out exactly what’s happening can get hairy.

Getting ready

For this recipe we’ll just need Incanter in our project.clj file, so we’ll use the dependencies statement—as well as the use statement—from the Loading Clojure data structures into datasets recipe.

For data, we’ll use the matrix that we created in the Converting datasets to matrices recipe.

How to do it…

Incanter has a macro that converts a standard math notation to a lisp notation. We’ll explore that in this recipe:

  1. The $= macro changes its contents to use an infix notation, which is what we’re used to from math class:
    user=> ($= 7 * 4)
    28
    user=> ($= 7 * 4 + 3)
    31
  2. We can also work on whole matrixes or just parts of matrixes. In this example, we perform a scalar multiplication of the matrix:
    user=> ($= va-matrix * 4)
    A 591x3 matrix
    ---------------
    3.28e+04 1.71e+04 8.22e+03 2.08e+03 9.16e+02 4.68e+02 1.19e+03 6.52e+02 3.08e+02
    ...
    1.41e+03 7.32e+02 3.72e+02 1.31e+04 6.64e+03 3.49e+03 3.02e+04 9.60e+03 6.90e+03 user=> ($= (first va-matrix) * 4)
    A 1x3 matrix
    -------------
    3.28e+04 1.71e+04 8.22e+03
  3. Using this, we can build complex expressions, such as this expression that takes the mean of the values in the first row of the matrix:
    user=> ($= (sum (first va-matrix)) /
               (count (first va-matrix)))
    4839.333333333333
  4. Or we can build expressions take the mean of each column, as follows:
    user=> ($= (reduce plus va-matrix) / (count va-matrix))
    A 1x3 matrix
    -------------
    9.19e+03 3.83e+03 2.25e+03

How it works…

Any time you’re working with macros and you wonder how they work, you can always get at their output expressions easily, so you can see what the computer is actually executing. The tool to do this is macroexpand-1. This expands the macro one step and returns the result. It’s sibling function, macroexpand, expands the expression until there is no macro expression left. Usually, this is more than we want, so we just use macroexpand-1.

Let’s see what these macros expand into:

user=> (macroexpand-1 '($= 7 * 4))
(incanter.core/mult 7 4)
user=> (macroexpand-1 '($= 7 * 4 + 3))
(incanter.core/plus (incanter.core/mult 7 4) 3)
user=> (macroexpand-1 '($= 3 + 7 * 4))
(incanter.core/plus 3 (incanter.core/mult 7 4))

Here, we can see that the expression doesn’t expand into Clojure’s * or + functions, but it uses Incanter’s matrix functions, mult and plus, instead. This allows it to handle a variety of input types, including matrices, intelligently.

Otherwise, it switches around the expressions the way we’d expect. Also, we can see by comparing the last two lines of code that it even handles operator precedence correctly.

Selecting columns with $

Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns.

The Incanter macro $ slices out parts of a dataset. In this recipe, we’ll see this in action.

Getting ready

For this recipe, we’ll need to have Incanter listed in our project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                [org.clojure/data.csv "0.1.2"]])

We’ll also need to include these libraries in our script or REPL:

(require '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clojure.string :as str]
         '[incanter.core :as i])

Moreover, we’ll need some data. This time, we’ll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name.

How to do it…

We’ll use the $ macro in several different ways to get different results. First, however, we’ll need to load the data into a dataset, which we’ll do in steps 1 and 2:

  1. Before we start, we’ll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those:
    (defn with-header [coll]
    (let [headers (map #(keyword (str/replace % space -))
                         (first coll))]
       (map (partial zipmap headers) (next coll))))
     
    (defn read-country-data [filename]
    (with-open [r (io/reader filename)]
       (i/to-dataset
         (doall (with-header
                   (drop 2 (csv/read-csv r)))))))
  2. Now, using these functions, we can load the data:
    user=> (def chn-data (read-country-data data-file))
  3. We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column:
    user=> (i/$ :Indicator-Code chn-data)
    ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" …
  4. We can select more than one column by listing all of them in a vector. This time, the results are in a dataset:
    user=> (i/$ [:Indicator-Code :1992] chn-data)
     
    |           :Indicator-Code |               :1992 |
    |---------------------------+---------------------|
    |           AG.AGR.TRAC.NO |             770629 |
    |         AG.CON.FERT.PT.ZS |                     |
    |           AG.CON.FERT.ZS |                     |
    |           AG.LND.AGRI.K2 |             5159980 |
    …
  5. We can list as many columns as we want, although the formatting might suffer:
    user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)
     
    |           :Indicator-Code |               :1992 |               :2002 |
    |---------------------------+---------------------+---------------------|
    |           AG.AGR.TRAC.NO |            770629 |                     |
    |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 |
    |           AG.CON.FERT.ZS |                     |   373.087159048868 |
    |           AG.LND.AGRI.K2 |             5159980 |             5231970 |
    …

How it works…

The $ function is just a wrapper over Incanter’s sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis.

There’s more…

The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too:

user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)
 
|   :Indicator-Code |                                               :Indicator-Name |
|-------------------+---------------------------------------------------------------|
|   AG.AGR.TRAC.NO |                             Agricultural machinery, tractors |
| AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) |
|   AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) |
…

See also…

  • For information on how to pull out specific rows, see the next recipe, Selecting rows with $.

Selecting rows with $

The Incanter macro $ also pulls rows out of a dataset. In this recipe, we’ll see this in action.

Getting ready

For this recipe, we’ll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe.

How to do it…

Similar to how we use $ in order to select columns, there are several ways in which we can use it to select rows, shown as follows:

  1. We can create a sequence of the values of one row using $, and pass it the index of the row we want as well as passing :all for the columns:
    user=> (i/$ 0 :all chn-data)
    ("AG.AGR.TRAC.NO" "684290" "738526" "52661" "" "880859" "" "" "" "59657" "847916" 
    "862078" "891170" "235524" "126440" "469106" "282282" "817857" "125442" "703117" "CHN"
    "66290" "705723" "824113" "" "151281" "669675" "861364" "559638" "191220" "180772" "73021"
    "858031" "734325" "Agricultural machinery, tractors" "100432" "" "796867" "" "China" ""
    "" "155602" "" "" "770629" "747900" "346786" "" "398946" "876470" "" "795713" "" "55360" "685202" "989139" "798506" "")
  2. We can also pull out a dataset containing multiple rows by passing more than one index into $ with a vector (There’s a lot of data, even for three rows, so I won’t show it here):
    (i/$ (range 3) :all chn-data)
  3. We can also combine the two ways to slice data in order to pull specific columns and rows. We can either pull out a single row or multiple rows:
    user=> (i/$ 0 [:Indicator-Code :1992] chn-data)
    ("AG.AGR.TRAC.NO" "770629")
    user=> (i/$ (range 3) [:Indicator-Code :1992] chn-data)
     
    |   :Indicator-Code | :1992 |
    |-------------------+--------|
    |   AG.AGR.TRAC.NO | 770629 |
    | AG.CON.FERT.PT.ZS |       |
    |   AG.CON.FERT.ZS |       |

How it works…

The $ macro is the workhorse used to slice rows and project (or select) columns from datasets. When it’s called with two indexing parameters, the first is the row or rows and the second is the column or columns.

Filtering datasets with $where

While we can filter datasets before we import them into Incanter, Incanter makes it easy to filter and create new datasets from the existing ones. We’ll take a look at its query language in this recipe.

Getting ready

We’ll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe.

How to do it…

Once we have the data, we query it using the $where function:

  1. For example, this creates a dataset with a row for the percentage of China’s total land area that is used for agriculture:
    user=> (def land-use
             (i/$where {:Indicator-Code "AG.LND.AGRI.ZS"}
                       chn-data))
    user=> (i/nrow land-use)
    1
    user=> (i/$ [:Indicator-Code :2000] land-use)
    ("AG.LND.AGRI.ZS" "56.2891584865366")
  2. The queries can be more complicated too. This expression picks out the data that exists for 1962 by filtering any empty strings in that column:
    user=> (i/$ (range 5) [:Indicator-Code :1962]
             (i/$where {:1962 {:ne ""}} chn-data))
     
    |   :Indicator-Code |             :1962 |
    |-------------------+-------------------|
    |   AG.AGR.TRAC.NO |             55360 |
    |   AG.LND.AGRI.K2 |           3460010 |
    |   AG.LND.AGRI.ZS | 37.0949187612906 |
    |   AG.LND.ARBL.HA |         103100000 |
    | AG.LND.ARBL.HA.PC | 0.154858284392508 |

Incanter’s query language is even more powerful than this, but these examples should show you the basic structure and give you an idea of the possibilities.

How it works…

To better understand how to use $where, let’s break apart the last example:

($i/where {:1962 {:ne ""}} chn-data)

The query is expressed as a hashmap from fields to values (highlighted). As we saw in the first example, the value can be a raw value, either a literal or an expression. This tests for inequality.

($i/where {:1962 {:ne ""}} chn-data)

Each test pair is associated with a field in another hashmap (highlighted).

In this example, both the hashmaps shown only contain one key-value pair. However, they might contain multiple pairs, which will all be ANDed together.

Incanter supports a number of test operators. The basic boolean tests are :$gt (greater than), :$lt (less than), :$gte (greater than or equal to), :$lte (less than or equal to), :$eq (equal to), and :$ne (not equal). There are also some operators that take sets as parameters: :$in and :$nin (not in).

The last operator—:$fn—is interesting. It allows you to use any predicate function. For example, this will randomly select approximately half of the dataset:

(def random-half
(i/$where {:Indicator-Code {:$fn (fn [_] (< (rand) 0.5))}}
           chnchn-data))

There’s more…

For full details of the query language, see the documentation for incanter.core/query-dataset (http://liebke.github.com/incanter/core-api.html#incanter.core/query-dataset).

Grouping data with $group-by

Datasets often come with an inherent structure. Two or more rows might have the same value in one column, and we might want to leverage that by grouping those rows together in our analysis.

Getting ready

First, we’ll need to declare a dependency on Incanter in the project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.csv "0.1.2"]])

Next, we’ll include Incanter core and io in our script or REPL:

(require '[incanter.core :as i]
         '[incanter.io :as i-io])

For data, we’ll use the census race data for all the states. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv.

These lines will load the data into the race-data name:

(def data-file "data/all_160.P3.csv")
(def race-data (i-io/read-dataset data-file :header true))

How to do it…

Incanter lets you group rows for further analysis or to summarize them with the $group-by function. All you need to do is pass the data to $group-by with the column or function to group on:

(def by-state (i/$group-by :STATE race-data))

How it works…

This function returns a map where each key is a map of the fields and values represented by that grouping. For example, this is how the keys look:

user=> (take 5 (keys by-state))
({:STATE 29} {:STATE 28} {:STATE 31} {:STATE 30} {:STATE 25})

We can get the data for Virginia back out by querying the group map for state 51.

user=> (i/$ (range 3) [:GEOID :STATE :NAME :POP100]
           (by-state {:STATE 51}))
 
| :GEOID | :STATE |         :NAME | :POP100 |
|---------+--------+---------------+---------|
| 5100148 |     51 | Abingdon town |   8191 |
| 5100180 |     51 | Accomac town |     519 |
| 5100724 |     51 | Alberta town |     298 |

Saving datasets to CSV and JSON

Once you’ve done the work of slicing, dicing, cleaning, and aggregating your datasets, you might want to save them. Incanter by itself doesn’t have a good way to do this. However, with the help of some Clojure libraries, it’s not difficult at all.

Getting ready

We’ll need to include a number of dependencies in our project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                 [incanter "1.5.5"]
                 [org.clojure/data.csv "0.1.2"]
                 [org.clojure/data.json "0.2.5"]])

We’ll also need to include these libraries in our script or REPL:

(require '[incanter.core :as i]
         '[incanter.io :as i-io]
         '[clojure.data.csv :as csv]
         '[clojure.data.json :as json]
         '[clojure.java.io :as io])

Also, we’ll use the same data that we introduced in the Selecting columns with $ recipe.

How to do it…

This process is really as simple as getting the data and saving it. We’ll pull out the data for the year 2000 from the larger dataset. We’ll use this subset of the data in both the formats here:

(def data2000
(i/$ [:Indicator-Code :Indicator-Name :2000] chn-data))

Saving data as CSV

To save a dataset as a CSV, all in one statement, open a file and use clojure.data.csv/write-csv to write the column names and data to it:

(with-open [f-out (io/writer "data/chn-2000.csv")]
(csv/write-csv f-out [(map name (i/col-names data2000))])
(csv/write-csv f-out (i/to-list data2000)))

Saving data as JSON

To save a dataset as JSON, open a file and use clojure.data.json/write to serialize the file:

(with-open [f-out (io/writer "data/chn-2000.json")]
(json/write (:rows data2000) f-out))

How it works…

For CSV and JSON, as well as many other data formats, the process is very similar. Get the data, open the file, and serialize data into it. There will be differences in how the output function wants the data (to-list or :rows), and there will be differences in how the output function is called (for instance, whether the file handle is the first or second argument). But generally, outputting datasets will be very similar and relatively simple.

Projecting from multiple datasets with $join

So far, we’ve been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we’d like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together.

Getting ready

First, we’ll need to include these dependencies in our project.clj file:

(defproject inc-dsets "0.1.0"
:dependencies [[org.clojure/clojure "1.6.0"]
                [incanter "1.5.5"]
                 [org.clojure/data.csv "0.1.2"]])
We'll use these statements for inclusions:
(require '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clojure.string :as str]
         '[incanter.core :as i])

For our data file, we’ll use the same data that we introduced in the Selecting columns with $ recipe: China’s development dataset from the World Bank.

How to do it…

In this recipe, we’ll take a look at how to join two datasets using Incanter:

  1. To begin with, we’ll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We’ll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe:
    (def data-file "data/chn/chn_Country_en_csv_v2.csv")
    (def chn-data (read-country-data data-file))
  2. Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let’s first pull out the data from 2 years into separate datasets. Note that for the second dataset, we’ll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000):
    (def chn-1990
    (i/$ [:Indicator-Code :Indicator-Name :1990]
           chn-data))
    (def chn-2000
    (i/$ [:Indicator-Code :2000] chn-data))
  3. Now, we’ll join these datasets back together. This is contrived, but it’s easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries:
    (def chn-decade
    (i/$join [:Indicator-Code :Indicator-Code]
               chn-1990 chn-2000))

From this point on, we can use chn-decade just as we use any other Incanter dataset.

How it works…

Let’s take a look at this in more detail:

(i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000)

The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000).

This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets.

Summary

In this article we have covered covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here