(For more resources related to this topic, see here.)

Indexing data that is not flat

Not all data is flat. Of course if we are building our system, which ElasticSearch will be a part of, we can create a structure that is convenient for ElasticSearch. However, it doesn't need to be flat, it can be more object-oriented. Let's see how to create mappings that use fully structured JSON objects.

Data

Let's assume we have the following data (we store it in the file called structured_data.json):

{
"book" : {
"author" : {
"name" : {
"firstName" : "Fyodor",
"lastName" : "Dostoevsky"
}
},
"isbn" : "123456789",
"englishTitle" : "Crime and Punishment",
"originalTitle" : "Преступлéние и наказáние",
"year" : 1886,
"characters" : [
{
"name" : "Raskolnikov"
},
{
"name" : "Sofia"
}
],
"copies" : 0
}
}

As you can see, the data is not flat. It contains arrays and nested objects, so we can't use our mappings that we used previously. But we can create mappings that will be able to handle such data.

Objects

The previous example data shows a structured JSON file. As you can see, the root object in our file is book. The root object is a special one, which allows us to define additional properties. The book root object has some simple properties such as englishTitle, originalTitle, and so on. Those will be indexed as normal fields in the index. In addition to that it has the characters array type, which we will discuss in the next paragraph. For now, let's focus on author. As you can see, author is an object that has another object nested in it, that is, the name object, which has two properties firstName and lastName.

Arrays

We have already used array type data, but we didn't talk about it. By default all fields in Lucene and thus in ElasticSearch are multivalued, which means that they can store multiple values. In order to send such fields for indexing to ElasticSearch we use the JSON array type, which is nested within the opening and closing square brackets []. As you can see in the previous example, we used the array type for characters property.

Mappings

So, what can we do to index such data as that shown previously? To index arrays we don't need to do anything, we just specify the properties for such fields inside the array name. So in our case in order to index the characters data present in the data we would need to add such mappings as these:

"characters" : {
"properties" : {
"name" : {"type" : "string", "store" : "yes"}
}
}

Nothing strange, we just nest the properties section inside the array's name (which is characters in our case) and we define fields there. As a result of this mapping, we would get the characters.name multivalued field in the index.

We perform similar steps for our author object. We call the section by the same name as is present in the data, but in addition to the properties section we also tell ElasticSearch that it should expect the object type by adding the type property with the value object. We have the author object, but it also has the name object nested in it, so we do the same; we just nest another object inside it. So, our mappings for that would look like the following code:

"author" : {
"type" : "object",
"properties" : {
"name" : {
"type" : "object",
"properties" : {
"firstName" : {"type" : "string", "store" : "yes"},
"lastName" : {"type" : "string", "store" : "yes"}
}
}
}
}

The firstName and lastName fields would appear in the index as author.name.firstName and author.name.lastName. We will check if that is true in just a second.

The rest of the fields are simple core types, so I'll skip discussing them.

Final mappings

So our final mappings file that we've called structured_mapping.json looks like the following:

{
"book" : {
"properties" : {
"author" : {
"type" : "object",
"properties" : {
"name" : {
"type" : "object",
"properties" : {
"firstName" : {"type" : "string", "store" : "yes"},
"lastName" : {"type" : "string", "store" : "yes"}
}
}
}
},
"isbn" : {"type" : "string", "store" : "yes"},
"englishTitle" : {"type" : "string", "store" : "yes"},
"originalTitle" : {"type" : "string", "store" : "yes"},
"year" : {"type" : "integer", "store" : "yes"},
"characters" : {
"properties" : {
"name" : {"type" : "string", "store" : "yes"}
}
},
"copies" : {"type" : "integer", "store" : "yes"}
}
}
}

To be or not to be dynamic

As we already know, ElasticSearch is schemaless, which means that it can index data without the need of first creating the mappings (although we should do so if we want to control the index structure). The dynamic behavior of ElasticSearch is turned on by default, but there may be situations where you may want to turn it off for some parts of your index. In order to do that, one should add the dynamic property set to false on the same level of nesting as the type property for the object that shouldn't be dynamic. For example, if we would like our author and name objects not to be dynamic, we should modify the relevant parts of the mappings file so that it looks like the following code:

"author" : {
"type" : "object",
"dynamic" : false,
"properties" : {
"name" : {
"type" : "object",
"dynamic" : false,
"properties" : {
"firstName" : {"type" : "string", "store" : "yes"},
"lastName" : {"type" : "string", "store" : "yes"}
}
}
}
}

However, please remember that in order to add new fields for such objects, we would have to update the mappings.

You can also turn off the dynamic mapping functionality by adding the index.mapper.dynamic : false property to your elasticsearch.yml configuration file.

Sending the mappings to ElasticSearch

The last thing I would like to do is test if all the work we did actually works. This time we will use a slightly different technique of creating an index and adding the mappings. First, let's create the library index with the following command:

curl -XPUT 'localhost:9200/library'

Now, let's send our mappings for the book type:

curl -XPUT 'localhost:9200/library/book/_mapping' -d
@structured_mapping.json

Now we can index our example data:

curl -XPOST 'localhost:9200/library/book/1' -d
@structured_data.json

If we would like to see how our data was indexed, we can run a query like the following:

curl -XGET 'localhost:9200/library/book/_search?q=
*:*&fields=*&pretty=true'

It will return the following data:

{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "library",
"_type" : "book",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"copies" : 0,
"characters.name" : [ "Raskolnikov", "Sofia" ],
"englishTitle" : "Crime and Punishment",
"author.name.lastName" : "Dostoevsky",
"isbn" : "123456789",
"originalTitle" : "Преступлéние и наказáние",
"year" : 1886,
"author.name.firstName" : "Fyodor"
}
} ]
}
}

As you can see, all the fields from arrays and object types are indexed properly. Please notice that there is, for example, the author.name.firstName field present, because ElasticSearch did flatten the data.