Indexing the Data

0
77
9 min read

(For more resources related to this topic, see here.)

Elasticsearch indexing

We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you’ll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let’s see what the indexing process looks like.

Shards and replicas

The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster.

Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term “replica” is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies.

Having a shard and its replica, in general, means that when we index a document, we will modify them both. That’s because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows:

  • More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance.
  • More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards.
  • More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive.
  • Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well.
  • The more the replicas, the higher the query throughput will be. That’s because the query can use either a shard or any of its copies to execute the query.

Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch.

So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data.

Creating indices

When we created our first document in Elasticsearch, we didn’t care about index creation at all. We just used the following command:

curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New  
version of Elasticsearch released!", "content": "...", "tags":
["announce", "elasticsearch", "release"] }'

This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command:

curl -XPUT http://localhost:9200/blog/

We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch:

{"acknowledged":true}

When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards.

Altering automatic index creation

Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file:

action.auto_create_index: false

Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn’t exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren’t allowed and must be created manually (because of -*).

action.auto_create_index: -an*,+a*,-*

Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won’t be used because of +a* , which will be checked first.

Settings for a newly created index

The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let’s look at the following example:

curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }'

The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way.

So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we’ll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows:

curl –XDELETE http://localhost:9200/posts

And the response will be the same as the one we saw earlier, as follows:

{"acknowledged":true}

Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched.

Mappings configuration

If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you’ll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure.

Type determining mechanism

Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let’s look at the following document:

{   "field1": 10, "field2": "10" }

The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index:

curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }'

Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition.

Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let’s look at the following command for creating the index and type:

curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }'

The preceding command will result in the creation of an index called blog with the single type called article. We’ve also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them.

Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.

LEAVE A REPLY

Please enter your comment!
Please enter your name here