How to perform Data exploration with RethinkDB

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shahid Shaikh titled Mastering RethinkDB. This book will let you master the capabilities of RethinkDB and implement them to develop efficient real-time web applications.[/box]

In this article, we will learn to do data exploration in RethinkDB with the help of few use case.

Executing data exploration use cases

We have imported our database i.e. our mock data into our RethinkDB instance. Now it's time
to run a use case query and make use of it. But before we do so, we need to figure out one data alteration. We have made a mistake while generating mock data (on purpose actually) we have a $ sign before ctc. Hence, it becomes tough to perform salary-level queries. Before we move ahead, we need to figure out this problem, and basically get rid of the $ sign and update the ctc value to an integer instead of a string.

In order to do this, we need to perform the following operation:

Traverse through each document in the database
Split the ctc string into two parts, containing $ and the other value
Update the ctc value in the document with a new data type and value

Since we require the chaining of queries, I have written a small snippet in Node.js to achieve the previous scenario as follows:

var rethinkdb = require('rethinkdb');
var connection = null;
rethinkdb.connect({host : 'localhost', port : 28015},function(err,conn) {
if(err) {
throw new Error('Connection error');
}
connection = conn;
rethinkdb.db("company").table("employees")
.run(connection,function(err,cursor) {
if(err) {
throw new Error(err);
}
cursor.each(function(err,data) {
data.ctc = parseInt(data.ctc.split("$")[1]);
rethinkdb.db("company").table("employees")
.get(data.id)
.update({ctc : data.ctc})
.run(connection,function(err,response) {
if(err) {
throw new Error(err);
}
console.log(response);
});
});
});
});

As you can see in the preceding code, we first fetch all the documents and traverse them using cursor, one document at a time. We use the split() method as a $ separator and convert the outcome, which is salary, into an integer using the parseInt() method. We update each document at a time using the id value of the document:

how-to-perform-data-exploration-with-rethinkdb-img-0

After selecting all the documents again, we can see an updated ctc value as an integer, as shown in the following figure:

how-to-perform-data-exploration-with-rethinkdb-img-1

This is one of the practical examples where we perform some data manipulation before moving ahead with complex queries. Similarly, you can look for errors such as blank spaces in a specific field or duplicate elements in your record.

Finding duplicate elements

We can use distinct() to find out whether there is any duplicate element present in the table. Say you have 1,000 rows and there are 10 duplicates. In order to determine that, we just need to find out the unique rows (of course excluding the ID key, as that's unique by nature).

Here is the query for the same:

r.db("company").table('employees').without('id').distinct().count()

As shown in the following screenshot, this query returns the count of unique rows, which should be 1,000 if there are no duplicates:

how-to-perform-data-exploration-with-rethinkdb-img-2

This implies that our records contain no duplicate documents.

Finding the list of countries

We can write a query to find all the countries we have in our record and also use distinct again by just selecting the country field. Here is the query:

r.db("company").table('employees')("country").distinct()

As shown in this image, we have 124 countries in our records:

how-to-perform-data-exploration-with-rethinkdb-img-3

Finding the top 10 employees with the highest salary

In this use case, we need to evaluate all the records and find the top 10 employees with the highest to lowest pay. Here is the query for the same:

r.db("company").table("employees").orderBy(r.desc("ctc")).limit(10)

Here we are using orderBy, which by default orders the record in ascending order. To get the highest pay at the first document, we need to use descending ordering; we did it using the desc() ReQL command.

As shown in the following image, the query returns 10 rows:

how-to-perform-data-exploration-with-rethinkdb-img-4

You can modify the same query by just by limiting the number of users to one to get thehighest-paid employee.

Displaying employee records with a specific name and location

To extract such records from our table, we need to again perform a filter on the "first_name" and "country" fields. Here is the query to return those records:

r.db("company").table('employees').filter({"first_name" : "John","country"

: "Sweden"})

We are just performing a basic filter and comparing both fields. ReQL queries are really easy for solving such queries due to their chaining feature. After executing the preceding query, we show the following output:

how-to-perform-data-exploration-with-rethinkdb-img-5

To summarize, we looked over a few use cases where we had to perform alteration and filtering of records in order to meet exploration task, like stripping the $ sign from ctc, or converting base 256 ip addresses into base 10 values and then performing a query on them. We also covered a general use case in order to get a practical feel of ReQL.

If you are interested to learn about RethinkDB Query Language, Extending RethinkDB, and more you may check out this book Mastering RethinkDB.