Home Data Tutorials Low-Level Index Control

Low-Level Index Control

October 28, 2013 - 12:00 am

1270

12 min read

(For more resources related to this topic, see here.)

Altering Apache Lucene scoring

With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch.

Setting per-field similarity

Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let’s assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file):

{
  "mappings" : {
    "post" : {
      "properties" : {
        "id" : { "type" : "long", "store" : "yes", "precision_step" :
        "0" },
        "name" : { "type" : "string", "store" : "yes", "index" :
        "analyzed" },
        "contents" : { "type" : "string", "store" : "no", "index" :
        "analyzed" }
      }
    }
  }
}

What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code:

{
  "mappings" : {
    "post" : {
      "properties" : {
        "id" : { "type" : "long", "store" : "yes", "precision_step" :
        "0" },
        "name" : { "type" : "string", "store" : "yes", "index" :
        "analyzed", "similarity" : "BM25" },
        "contents" : { "type" : "string", "store" : "no", "index" :
        "analyzed", "similarity" : "BM25" }
      }
    }
  }
}

And that’s all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields.

In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section.

Default codec properties

When using the default codec we are allowed to configure the following properties:

min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25.
max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48.

Direct codec properties

The direct codec allows us to configure the following properties:

min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8.
low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32.

Memory codec properties

By using the memory codec we are allowed to alter the following properties:

pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data.
acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead.

Pulsing codec properties

When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows:

freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed.

Bloom filter-based codec properties

If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties:

delegate: It specifies the name of the codec we want to wrap, with the bloom filter.
ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million.

For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file):

{
  "settings" : {
    "index" : {
      "codec" : {
        "postings_format" : {
          "custom_bloom" : {
            "type" : "bloom_filter",
            "delegate" : "direct",
            "ffp" : "10k=0.03, 1m=0.05"
          } 
        }
      }
    }
  },   
  "mappings" : {
    "post" : {
      "properties" : {
        "id" : { "type" : "long", "store" : "yes", "precision_step" :
        "0" },
        "name" : { "type" : "string", "store" : "yes", "index" : 
        "analyzed", "postings_format" : "custom_bloom" },
        "contents" : { "type" : "string", "store" : "no", "index" :
        "analyzed" }
      }
    }
  }
}

NRT, flush, refresh, and transaction log

In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let’s index an example document to the newly created index by using the following command:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }'

Now, we will replace this document and immediately we will try to find it. In order to do this, we’ll use the following command chain:

curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ;
 curl localhost:9200/test/test/_search?pretty

The preceding command will probably result in the response, which is very similar to the following response:

{"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "test",
      "_id" : "1",
      "_score" : 1.0, "_source" : { "title": "test" }
    } ]
  }
}

The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened?

But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching.

Updating index and committing changes

The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state.

Let’s return to our example. The first operation adds the document to the index, but doesn’t run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command:

curl –XGET localhost:9200/test/_refresh

If we add the preceding command before the search, ElasticSearch would respond as we had expected.

Changing the default refresh time

The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example:

curl -XPUT localhost:9200/test/_settings -d '{
  "index" : {
    "refresh_interval" : "5m"
  }
}'

The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won’t be visible by queries.

As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don’t need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done.

The transaction log

Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn’t enough space on the device, the device is faulty or there aren’t enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing.

Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared.

In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command:

curl –XGET localhost:9200/_flush

Or we can run the flush command for the particular index, which in our case is the one called library:

curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh

In the second example we used it together with the refresh, which after flushing the data opens a new searcher.

The transaction log configuration

If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior:

index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it’s better to do flush more often with less data being stored in it.
index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000.
index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB.
index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents.

All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards.

Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example:

curl -XPUT localhost:9200/test/_settings -d '{
 "index" : {
   "translog.disable_flush" : true
  }
}'

The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done.

Near Real Time GET

Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command:

curl -XGET localhost:9200/test/test/1?pretty

ElasticSearch should return the result similar to the following:

{
  "_index" : "test",
  "_type" : "test",
  "_id" : "1",
  "_version" : 2,
  "exists" : true, "_source" : { "title": "test2" }
}

If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…