Creating a basic Julia project for loading and saving data [Tutorial]

In this article, we take a look at the common Iris dataset using simple statistical methods. Then we create a simple Julia project to load and save data from the Iris dataset.

This article is an excerpt from a book written by Adrian Salceanu titled Julia Programming Projects. In this book, you will develop and run a web app using Julia and the HTTP package among other things.

To start, we'll load, the Iris flowers dataset, from the RDatasets package and we'll manipulate it using standard data analysis functions. Then we'll look more closely at the data by employing common visualization techniques. And finally, we'll see how to persist and (re)load our data.

But, in order to do that, first, we need to take a look at some of the language's most important building blocks.

Here are the external packages used in this tutorial and their specific versions:

CSV@v0.4.3
DataFrames@v0.15.2
Feather@v0.5.1
Gadfly@v1.0.1
IJulia@v1.14.1
JSON@v0.20.0
RDatasets@v0.6.1

In order to install a specific version of a package you need to run:

pkg> add PackageName@vX.Y.Z

For example:

pkg> add IJulia@v1.14.1

Alternatively, you can install all the used packages by downloading the Project.toml file using pkg> instantiate as follows:

julia> download("https://raw.githubusercontent.com/PacktPublishing/Julia-Programming-Projects/master/Chapter02/Project.toml", "Project.toml")
pkg> activate . 
pkg> instantiate

Using simple statistics to better understand our data

Now that it's clear how the data is structured and what is contained in the collection, we can get a better understanding by looking at some basic stats.

To get us started, let's invoke the describe function:

julia> describe(iris)

The output is as follows:

creating-a-basic-julia-project-for-loading-and-saving-data-tutorial-img-0

This function summarizes the columns of the iris DataFrame. If the columns contain numerical data (such as SepalLength), it will compute the minimum, median, mean, and maximum. The number of missing and unique values is also included. The last column reports the type of data stored in the row.

A few other stats are available, including the 25^th and the 75^th percentile, and the first and the last values. We can ask for them by passing an extra stats argument, in the form of an array of symbols:

julia> describe(iris, stats=[:q25, :q75, :first, :last])

The output is as follows:

creating-a-basic-julia-project-for-loading-and-saving-data-tutorial-img-1

Any combination of stats labels is accepted. These are all the options—:mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing.

In order to get all the stats, the special :all value is accepted:

julia> describe(iris, stats=:all)

The output is as follows:

creating-a-basic-julia-project-for-loading-and-saving-data-tutorial-img-2

We can also compute these individually by using Julia's Statistics package. For example, to calculate the mean of the SepalLength column, we'll execute the following:

julia> using Statistics 
julia> mean(iris[:SepalLength]) 
5.843333333333334

In this example, we use iris[:SepalLength] to select the whole column. The result, not at all surprisingly, is the same as that returned by the corresponding describe() value.

In a similar way we can compute the median():

julia> median(iris[:SepalLength]) 
5.8

And there's (a lot) more, such as, for instance, the standard deviation std():

julia> std(iris[:SepalLength]) 
0.828066127977863

Or, we can use another function from the Statistics package, cor(), in a simple script to help us understand how the values are correlated:

julia> for x in names(iris)[1:end-1]    
        for y in names(iris)[1:end-1] 
          println("$x \t $y \t $(cor(iris[x], iris[y]))") 
        end 
        println("-------------------------------------------") 
      end

Executing this snippet will produce the following output:

SepalLength       SepalLength    1.0 
SepalLength       SepalWidth     -0.11756978413300191 
SepalLength       PetalLength    0.8717537758865831 
SepalLength       PetalWidth     0.8179411262715759 
------------------------------------------------------------ 
SepalWidth         SepalLength    -0.11756978413300191 
SepalWidth         SepalWidth     1.0 
SepalWidth         PetalLength    -0.42844010433053953 
SepalWidth         PetalWidth     -0.3661259325364388 
------------------------------------------------------------ 
PetalLength       SepalLength    0.8717537758865831 
PetalLength       SepalWidth     -0.42844010433053953 
PetalLength       PetalLength    1.0 
PetalLength       PetalWidth     0.9628654314027963 
------------------------------------------------------------ 
PetalWidth         SepalLength    0.8179411262715759 
PetalWidth         SepalWidth     -0.3661259325364388 
PetalWidth         PetalLength    0.9628654314027963 
PetalWidth         PetalWidth     1.0 
------------------------------------------------------------

The script iterates over each column of the dataset with the exception of Species (the last column, which is not numeric), and generates a basic correlation table. The table shows strong positive correlations between SepalLength and PetalLength (87.17%), SepalLength and PetalWidth (81.79%), and PetalLength and PetalWidth (96.28%). There is no strong correlation between SepalLength and SepalWidth.

We can use the same script, but this time employ the cov() function to compute the covariance of the values in the dataset:

julia> for x in names(iris)[1:end-1] 
         for y in names(iris)[1:end-1] 
           println("$x \t $y \t $(cov(iris[x], iris[y]))") 
         end 
         println("--------------------------------------------") 
       end

This code will generate the following output:

SepalLength       SepalLength    0.6856935123042507 
SepalLength       SepalWidth     -0.04243400447427293 
SepalLength       PetalLength    1.2743154362416105 
SepalLength       PetalWidth     0.5162706935123043 
------------------------------------------------------- 
SepalWidth         SepalLength    -0.04243400447427293 
SepalWidth         SepalWidth     0.189979418344519 
SepalWidth         PetalLength    -0.3296563758389262 
SepalWidth         PetalWidth     -0.12163937360178968 
------------------------------------------------------- 
PetalLength       SepalLength    1.2743154362416105 
PetalLength       SepalWidth     -0.3296563758389262 
PetalLength       PetalLength    3.1162778523489933 
PetalLength       PetalWidth     1.2956093959731543 
------------------------------------------------------- 
PetalWidth         SepalLength    0.5162706935123043 
PetalWidth         SepalWidth     -0.12163937360178968 
PetalWidth         PetalLength    1.2956093959731543 
PetalWidth         PetalWidth     0.5810062639821031 
-------------------------------------------------------

The output illustrates that SepalLength is positively related to PetalLength and PetalWidth, while being negatively related to SepalWidth. SepalWidth is negatively related to all the other values.

Moving on, if we want a random data sample, we can ask for it like this:

julia> rand(iris[:SepalLength]) 
7.4

Optionally, we can pass in the number of values to be sampled:

julia> rand(iris[:SepalLength], 5) 
5-element Array{Float64,1}: 
 6.9 
 5.8 
 6.7 
 5.0 
 5.6

We can convert one of the columns to an array using the following:

julia> sepallength = Array(iris[:SepalLength]) 
150-element Array{Float64,1}: 
 5.1 
 4.9 
 4.7 
 4.6 
 5.0 
 # ... output truncated ...

Or we can convert the whole DataFrame to a matrix:

julia> irisarr = convert(Array, iris[:,:]) 
150×5 Array{Any,2}: 
 5.1  3.5  1.4  0.2  CategoricalString{UInt8} "setosa"    
 4.9  3.0  1.4  0.2  CategoricalString{UInt8} "setosa"    
 4.7  3.2  1.3  0.2  CategoricalString{UInt8} "setosa"    
 4.6  3.1  1.5  0.2  CategoricalString{UInt8} "setosa"    
 5.0  3.6  1.4  0.2  CategoricalString{

UInt8} "setosa"   
 # ... output truncated ...

Loading and saving our data

Julia comes with excellent facilities for reading and storing data out of the box. Given its focus on data science and scientific computing, support for tabular-file formats (CSV, TSV) is first class.

Let's extract some data from our initial dataset and use it to practice persistence and retrieval from various backends.

We can reference a section of a DataFrame by defining its bounds through the corresponding columns and rows. For example, we can define a new DataFrame composed only of the PetalLength and PetalWidth columns and the first three rows:

julia> iris[1:3, [:PetalLength, :PetalWidth]] 
3×2 DataFrames.DataFrame 
│ Row │ PetalLength │ PetalWidth │ 
├─────┼─────────────┼────────────┤ 
│ 1   │ 1.4         │ 0.2        │ 
│ 2   │ 1.4         │ 0.2        │ 
│ 3   │ 1.3         │ 0.2        │

The generic indexing notation is dataframe[rows, cols], where rows can be a number, a range, or an Array of boolean values where true indicates that the row should be included:

julia> iris[trues(150), [:PetalLength, :PetalWidth]]

This snippet will select all the 150 rows since trues(150) constructs an array of 150 elements that are all initialized as true. The same logic applies to cols, with the added benefit that they can also be accessed by name.

Armed with this knowledge, let's take a sample from our original dataset. It will include some 10% of the initial data and only the PetalLength, PetalWidth, and Species columns:

julia> test_data = iris[rand(150) .<= 0.1, [:PetalLength, :PetalWidth, :Species]] 
10×3 DataFrames.DataFrame 
│ Row │ PetalLength │ PetalWidth │ Species      │ 
├─────┼─────────────┼────────────┼──────────────┤ 
│ 1   │ 1.1         │ 0.1        │ "setosa"     │ 
│ 2   │ 1.9         │ 0.4        │ "setosa"     │ 
│ 3   │ 4.6         │ 1.3        │ "versicolor" │ 
│ 4   │ 5.0         │ 1.7        │ "versicolor" │ 
│ 5   │ 3.7         │ 1.0        │ "versicolor" │ 
│ 6   │ 4.7         │ 1.5        │ "versicolor" │ 
│ 7   │ 4.6         │ 1.4        │ "versicolor" │ 
│ 8   │ 6.1         │ 2.5        │ "virginica"  │ 
│ 9   │ 6.9         │ 2.3        │ "virginica"  │ 
│ 10  │ 6.7         │ 2.0        │ "virginica"  │

What just happened here? The secret in this piece of code is rand(150) .<= 0.1. It does a lot—first, it generates an array of random Float values between 0 and 1; then, it compares the array, element-wise, against 0.1 (which represents 10% of 1); and finally, the resultant Boolean array is used to filter out the corresponding rows from the dataset. It's really impressive how powerful and succinct Julia can be!

In my case, the result is a DataFrame with the preceding 10 rows, but your data will be different since we're picking random rows (and it's quite possible you won't have exactly 10 rows either).

Saving and loading using tabular file formats

We can easily save this data to a file in a tabular file format (one of CSV, TSV, and others) using the CSV package. We'll have to add it first and then call the write method:

pkg> add CSV 
julia> using CSV 
julia> CSV.write("test_data.csv", test_data)

And, just as easily, we can read back the data from tabular file formats, with the corresponding CSV.read function:

julia> td = CSV.read("test_data.csv") 
10×3 DataFrames.DataFrame 
│ Row │ PetalLength │ PetalWidth │ Species      │ 
├─────┼─────────────┼────────────┼──────────────┤ 
│ 1   │ 1.1         │ 0.1        │ "setosa"     │ 
│ 2   │ 1.9         │ 0.4        │ "setosa"     │ 
│ 3   │ 4.6         │ 1.3        │ "versicolor" │ 
│ 4   │ 5.0         │ 1.7        │ "versicolor" │ 
│ 5   │ 3.7         │ 1.0        │ "versicolor" │ 
│ 6   │ 4.7         │ 1.5        │ "versicolor" │ 
│ 7   │ 4.6         │ 1.4        │ "versicolor" │ 
│ 8   │ 6.1         │ 2.5        │ "virginica"  │ 
│ 9   │ 6.9         │ 2.3        │ "virginica"  │ 
│ 10  │ 6.7         │ 2.0        │ "virginica"  │

Just specifying the file extension is enough for Julia to understand how to handle the document (CSV, TSV), both when writing and reading.

Working with Feather files

Feather is a binary file format that was specially designed for storing data frames. It is fast, lightweight, and language-agnostic. The project was initially started in order to make it possible to exchange data frames between R and Python. Soon, other languages added support for it, including Julia.

Support for Feather files does not come out of the box, but is made available through the homonymous package. Let's go ahead and add it and then bring it into scope:

pkg> add Feather  
julia> using Feather

Now, saving our DataFrame is just a matter of calling Feather.write:

julia> Feather.write("test_data.feather", test_data)

Next, let's try the reverse operation and load back our Feather file. We'll use the counterpart read function:

julia> Feather.read("test_data.feather") 
10×3 DataFrames.DataFrame 
│ Row │ PetalLength │ PetalWidth │ Species      │ 
├─────┼─────────────┼────────────┼──────────────┤ 
│ 1   │ 1.1         │ 0.1        │ "setosa"     │ 
│ 2   │ 1.9         │ 0.4        │ "setosa"     │ 
│ 3   │ 4.6         │ 1.3        │ "versicolor" │ 
│ 4   │ 5.0         │ 1.7        │ "versicolor" │ 
│ 5   │ 3.7         │ 1.0        │ "versicolor" │ 
│ 6   │ 4.7         │ 1.5        │ "versicolor" │ 
│ 7   │ 4.6         │ 1.4        │ "versicolor" │ 
│ 8   │ 6.1         │ 2.5        │ "virginica"  │ 
│ 9   │ 6.9         │ 2.3        │ "virginica"  │ 
│ 10  │ 6.7         │ 2.0        │ "virginica"  │

Yeah, that's our sample data all right!

In order to provide compatibility with other languages, the Feather format imposes some restrictions on the data types of the columns. You can read more about Feather in the package's official documentation at https://juliadata.github.io/Feather.jl/latest/index.html.

Saving and loading with MongoDB

Let's also take a look at using a NoSQL backend for persisting and retrieving our data.

In order to follow through this part, you'll need a working MongoDB installation. You can download and install the correct version for your operating system from the official website, at https://www.mongodb.com/download-center?jmp=nav#community. I will use a Docker image which I installed and started up through Docker's Kitematic (available for download at https://github.com/docker/kitematic/releases).

Next, we need to make sure to add the Mongo package. The package also has a dependency on LibBSON, which is automatically added. LibBSON is used for handling BSON, which stands for Binary JSON, a binary-encoded serialization of JSON-like documents. While we're at it, let's add the JSON package as well; we will need it. I'm sure you know how to do that by now—if not, here is a reminder:

pkg> add Mongo, JSON

At the time of writing, Mongo.jl support for Julia v1 was still a work in progress. This code was tested using Julia v0.6.

Easy! Let's let Julia know that we'll be using all these packages:

julia> using Mongo, LibBSON, JSON

We're now ready to connect to MongoDB:

julia> client = MongoClient()

Once successfully connected, we can reference a dataframes collection in the db database:

julia> storage = MongoCollection(client, "db", "dataframes")

Julia's MongoDB interface uses dictionaries (a data structure called Dict in Julia) to communicate with the server. For now, all we need to do is to convert our DataFrame to such a Dict. The simplest way to do it is to sequentially serialize and then deserialize the DataFrame by using the JSON package. It generates a nice structure that we can later use to rebuild our DataFrame:

julia> datadict = JSON.parse(JSON.json(test_data))

Thinking ahead, to make any future data retrieval simpler, let's add an identifier to our dictionary:

julia> datadict["id"] = "iris_test_data"

Now we can insert it into Mongo:

julia> insert(storage, datadict)

In order to retrieve it, all we have to do is query the Mongo database using the "id" field we've previously configured:

Julia> data_from_mongo = first(find(storage, query("id" => "iris_test_data")))

We get a BSONObject, which we need to convert back to a DataFrame. Don't worry, it's straightforward. First, we create an empty DataFrame:

julia> df_from_mongo = DataFrame() 
0×0 DataFrames.DataFrame

Then we populate it using the data we retrieved from Mongo:

for i in 1:length(data_from_mongo["columns"]) 
  df_from_mongo[Symbol(data_from_mongo["colindex"]["names"][i])] =  
Array(data_from_mongo["columns"][i]) 
end 
julia> df_from_mongo 
10×3 DataFrames.DataFrame 
│ Row │ PetalLength │ PetalWidth │ Species      │ 
├─────┼─────────────┼────────────┼──────────────┤ 
│ 1   │ 1.1         │ 0.1        │ "setosa"     │ 
│ 2   │ 1.9         │ 0.4        │ "setosa"     │ 
│ 3   │ 4.6         │ 1.3        │ "versicolor" │ 
│ 4   │ 5.0         │ 1.7        │ "versicolor" │ 
│ 5   │ 3.7         │ 1.0        │ "versicolor" │ 
│ 6   │ 4.7         │ 1.5        │ "versicolor" │ 
│ 7   │ 4.6         │ 1.4        │ "versicolor" │ 
│ 8   │ 6.1         │ 2.5        │ "virginica"  │ 
│ 9   │ 6.9         │ 2.3        │ "virginica"  │ 
│ 10  │ 6.7         │ 2.0        │ "virginica"  │

And that's it! Our data has been loaded back into a DataFrame.

In this tutorial, we looked at the Iris dataset and worked on loading and saving the data in a simple Julia project. To learn more about machine learning recommendation in Julia and testing the model check out this book Julia Programming Projects.

Julia for machine learning. Will the new language pick up pace?

Announcing Julia v1.1 with better exception handling and other improvement

GitHub Octoverse: top machine learning packages, languages, and projects of 2018