21 min read

In this article by Rafał Kuć and Marek Rogozinski the authors of this book Elasticsearch Server Third Edition, we dived into Elasticsearch indexing. We learned a lot when it comes to data handling. We saw how to tune Elasticsearch schema-less mechanism and we now know how to create our own mappings. We also saw the core types of Elasticsearch and we used analyzers – both the one that comes out of the box with Elasticsearch and the one we define ourselves. We used bulk indexing, and we added additional internal information to our indices. Finally, we learned what segment merging is, how we can fine tune it, and how to use routing in Elasticsearch and what it gives us. This article is fully dedicated to querying. By the end of this article, you will have learned the following topics:

  • How to query Elasticsearch
  • Using the script process
  • Understanding the querying process

(For more resources related to this topic, see here.)

Querying Elasticsearch

So far, when we searched our data, we used the REST API and a simple query or the GET request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details. A similar situation happens when we want to send more than a simple query to Elasticsearch we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term query, are used for querying the actual data. The second type of query is the compound query, such as the bool query, which can combine multiple queries. However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don’t affect scoring and are usually very efficient and easily cached.

To make it even more complicated, queries can contain other queries (don’t worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now.

The example data

If not stated otherwise, the following mappings will be used for the rest of the article:

{
  "book" : {
    "properties" : {
      "author" : {
        "type" : "string"
      },
      "characters" : {
        "type" : "string"
      },
      "copies" : {
        "type" : "long",
        "ignore_malformed" : false
      },
      "otitle" : {
        "type" : "string"
      },
      "tags" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "title" : {
        "type" : "string"
      },
      "year" : {
        "type" : "long",
        "ignore_malformed" : false,
        "index" : "analyzed"
      },
      "available" : {
        "type" : "boolean"
      }
    }
  }
}

The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don’t configure it differently.

The preceding mappings were stored in the mapping.json file and in order to create the mentioned library index we can use the following commands:

curl -XPOST 'localhost:9200/library'
curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json

We also used the following sample data as the example ones for this article:

{ "index": {"_index": "library", "_type": "book", "_id": "1"}}
{ "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3}
{ "index": {"_index": "library", "_type": "book", "_id": "2"}}
{ "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1}
{ "index": {"_index": "library", "_type": "book", "_id": "3"}}
{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12}
{ "index": {"_index": "library", "_type": "book", "_id": "4"}}
{ "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}

We stored our sample data in the documents.json file, and we use the following command to index it:

curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json

A simple query

The simplest way to query Elasticsearch is to use the URI request query. For example, to search for the word crime in the title field, you could send a query using the following command:

curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty'

This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is the query_string query. It searches for the documents that have the term crime in the title field and can be rewritten as follows:

{
  "query" : { 
    "query_string" : { "query" : "title:crime" }
  }
}

Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET (POST is also accepted in case your tool or library doesn’t allow sending request body in HTTP GET requests) HTTP request to the _search REST endpoint as earlier and include the query in the request body. Let’s take a look at the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to Elasticsearch. The pretty request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna 
          Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    } ]
  }
}

Nice! We got our first search results with the query DSL.

Paging and result size

Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body:

  • from: This property specifies the document that we want to have our results from. Its default value is 0, which means that we want to get our results from the first document.
  • size: This property specifies the maximum number of documents we want as the result of a single query (which defaults to 10). For example, if weare only interested in aggregations results and don’t care about the documents returned by the query, we can set this parameter to 0.

If we want our query to get documents starting from the tenth item on the list and get 20 of items from there on, we send the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "from" :  9,
  "size" : 20,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

Returning the version value

In addition to all the information returned, Elasticsearch can return the version of the document. To do this, we need to add the version property with the value of true to the top level of our JSON object. So, the final query, which requests for version information, will look as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
     "version" : true,
     "query" : {
       "query_string" : { "query" : "title:crime" }
     }
}'

After running the preceding query, we get the following results:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_version" : 1,
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    } ]
  }
}

As you can see, the _version section is present for the single hit we got.

Limiting the score

For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75, we send the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "min_score" : 0.75,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

We get the following response after running the preceding query:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

If you look at the previous examples, the score of our document was 0.5, which is lower than 0.75, and thus we didn’t get any documents in response.

Limiting the score usually doesn’t make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed.

Choosing the fields that we want to return

With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source field was used (Elasticsearch uses the _source field to provide the stored values and the _source field is turned on by default).

So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "fields" : [ "title", "year" ],
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

In response, we get the following output:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "year" : [ 1886 ]
      }
    } ]
  }
}

As you can see, everything worked as we wanted to. There are four things we will like to share with you, which are as follows:

  • If we don’t define the fields array, it will use the default value and return the _source field if available.
  • If we use the _source field and request a field that is not stored, then that field will be extracted from the _source field (however, this requires additional processing).
  • If we want to return all the stored fields, we just pass an asterisk (*) as the field name.
  • From a performance point of view, it’s better to return the _source field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single _source field.

Source filtering

In addition to choosing which fields are returned, Elasticsearch allows us to use the so-called source filtering. This functionality allows us to control which fields are returned from the _source field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : false,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

The result retuned by Elasticsearch should be similar to the following one:

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5
    } ]
  }
}

Note that the response is limited to base information about a document and the _source field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need.

The second way is similar to as described in the preceding fields, although we define which fields should be returned in the document source itself. Let’s see that using the following example query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : ["title", "otitle"],
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

We wanted to get the title and the otitle document fields in the returned _source field. Elasticsearch extracted those values from the original _source value and included the _source field only with the requested fields. The whole response returned by Elasticsearch looked as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "otitle" : "Преступлéние и наказáние",
        "title" : "Crime and Punishment"
      }
    } ]
  }
}

We can also use asterisk to select which fields should be returned in the _source field; for example, title* will return value for the title field and for title10 (if we have such field in our data). If we have more extended document with nested part, we can use notation with dot; for example, title.* to select all the fields nested under the title object.

Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source field. We can include fields using the include property and we can exclude fields using the exclude property (both of them are arrays of values). For example, if we want the returned _source field to include all the fields starting with the letter t but not the title field, we will run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : { 
    "include" : [ "t*"], 
    "exclude" : ["title"] 
  },
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

Using the script fields

Elasticsearch allows us to use script-evaluated values that will be returned with the result documents. To use the script fields functionality, we add the script_fields section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear, which is calculated as the year field minus 1800, we run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "doc["year"].value - 1800"
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

By default, Elasticsearch doesn’t allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline] with operation [search] and language [groovy] are disabled. To make this example work, you should add the script.inline: on property to the elasticsearch.yml file. However, this exposes a security threat.

Using the doc notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source field. The same query using the _source field looks as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "_source.year - 1800"
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  } 
}'

The following response is returned by Elasticsearch with dynamic scripting enabled:

{
  "took" : 76,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "fields" : {
        "correctYear" : [ 86 ]
      }
    } ]
  }
}

As you can see, we got the calculated correctYear field in response.

Passing parameters to the script fields

Let’s take a look at one more feature of the script fields – passing of additional parameters. Instead of having the value 1800 in the equation, we can usea variable name and pass its value in the params section. If we do this, our query will look as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "_source.year - paramYear",
      "params" : {
        "paramYear" : 1800
      }
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

As you can see, we added the paramYear variable as part of the scripted equation and provided its value in the params section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way.

Understanding the querying process

After reading the previous section, we now know how querying works in Elasticsearch. You know that Elasticsearch, in most cases, needs to scatter the query across multiple nodes, get the results, merge them, fetch the relevant documents from one or more shards, and return the final results to the client requesting the documents. What we didn’t talk about are two additional things that define how queries behave: search type and query execution preference. We will now concentrate on these functionalities of Elasticsearch.

Query logic

Elasticsearch is a distributed search engine and so all functionality provided must be distributed in its nature. It is exactly the same with querying. Because we would like to discuss some more advanced topics on how to control the query process, we first need to know how it works.

Let’s now get back to how querying works. By default, if we don’t alter anything, the query process will consist of two phases: the scatter and the gather phase. The aggregator node (the one that receivesthe request) will run the scatter phase first. During that phase, the query is distributed to all the shards that our index is built of (of course if routing is not used). For example, if it is built of 5 shards and 1 replica then 5 physical shards will be queried (we don’t need to query a shard and its replica as they contain the same data). Each of the queried shards will only return the document identifier and the score of the document. The node that sent the scatter query will wait for all the shards to complete their task, gather the results, and sort them appropriately (in this case, from top scoring to the lowest scoring ones).

After that, a new request will be sent to build the search results. However, now only to those shards that held the documents to build the response. In most cases, Elasticsearch won’t send the request to all the shards but to its subset. That’s because we usually don’t get the complete result of the query but only a portion of it. This phase is called the gather phase. After all the documents are gathered, the final response is built and returned as the query result. This is the basic and default Elasticsearch behavior but we can change it.

Search type

Elasticsearch allows us to choose how we want our query to be processed internally. We can do that by specifying the search type. There are different situations where different search type are appropriate: sometimes one can care only about the performance while sometimes query relevance is the most important factor. You should remember that each shard is a small Lucene index and in order to return more relevant results, some information, such as frequencies, needs to be transferred between the shards. To control how the queries are executed, we can pass the search_type request parameter and set it to one of the following values:

  • query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank the documents. This step is executed against all the shards. Then only the relevant shards are queried for the actual content of the documents. Different from query_and_fetch, the maximum number of results returned by this query type will be equal to the size parameter. This is the search type used by default if no search type is provided with the query, and this is the query type we described previously.
  • dfs_query_then_fetch: Again, as with the previous dfs_query_and_fetch, dfs_query_then_fetch is similar to its counterpart query_then_fetch. However, it contains an additional phase comparing which calculates distributed term frequencies.

There are also two deprecated search types: count and scan. The first one is deprecated starting from Elasticsearch 2.0 and the second one starting with Elasticsearch 2.1. The first search type used to provide benefits where only aggregations or the number of documents was relevant, but now it is enough to add size equal to 0 to your queries. The scan request was used for scrolling functionality.

So if we would like to use the simplest search type, we would run the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty&search_type=query_then_fetch' -d '{
 "query" : {
  "term" : { "title" : "crime" }
 }
}'

Search execution preference

In addition to the possibility of controlling how the query is executed, we can also control on which shards to execute the query. By default, Elasticsearch uses shards and replicas, both the ones available on the node we’ve sent the request and on the other nodes in the cluster. The default behavior is mostly the proper method of shard preference of queries. But there may be times when we want to change the default behavior. For example, you may want the search to be only executed on the primary shards. To do that, we can set the preference request parameter to one of the following values:

  • _primary: The operation will be only executed on the primary shards, so the replicas won’t be used. This can be useful when we need to use the latest information from the index but our data is not replicated right away.
  • _primary_first: The operation will be executed on the primary shards if they are available. If not, it will be executed on the other shards.
  • _replica: The operation will be executed only on the replica shards.
  • _replica_first: This operation is similar to _primary_first, but uses replica shards. The operation will be executed on the replica shards if possible, and on the primary shards if the replicas are not available.
  • _local: The operation will be executed on the shards available on the node which the request was sent and if such shards are not present, the request will be forwarded to the appropriate nodes.
  • _only_node:node_id: This operation will be executed on the node with the provided node identifier.
  • _only_nodes:nodes_spec: This operation will be executed on the nodes that are defined in nodes_spec. This can be an IP address, a name, a name or IP address using wildcards, and so on. For example, if nodes_spec is set to 192.168.1.*, the operation will be run on the nodes with IP address starting with 192.168.1.
  • _prefer_node:node_id: Elasticsearch will try to execute the operation on the node with the provided identifier. However, if the node is not available, it will be executed on the nodes that are available.
  • _shards:1,2: Elasticsearch will execute the operation on the shards with the given identifiers; in this case, on shards with identifiers 1 and 2. The _shards parameter can be combined with other preferences, but the shards identifiers need to be provided first. For example, _shards:1,2;_local.
  • Custom value: Any custom, string value may be passed. Requests with the same values provided will be executed on the same shards.

For example, if we would like to execute a query only on the local shards, we would run the following command:

curl -XGET 'localhost:9200/library/_search?pretty&preference=_local' -d '{
 "query" : {
  "term" : { "title" : "crime" }
 }
}'

Search shards API

When discussing the search preference, we will also like to mention the search shards API exposed by Elasticsearch. This API allows us to check which shards the query will be executed at. In order to use this API, run a request against the search_shards rest end point. For example, to see how the query will be executed, we run the following command:

curl -XGET 'localhost:9200/library/_search_shards?pretty' -d '{"query":"match_all":{}}'

The response to the preceding command will be as follows:

{
  "nodes" : {
    "my0DcA_MTImm4NE3cG3ZIg" : {
      "name" : "Cloud 9",
      "transport_address" : "127.0.0.1:9300",
      "attributes" : { }
    }
  },
  "shards" : [ [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 0,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "9ayLDbL1RVSyJRYIJkuAxg"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 1,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "wfpvtaLER-KVyOsuD46Yqg"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 2,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "zrLPWhCOSTmjlb8TY5rYQA"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 3,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "efnvY7YcSz6X8X8USacA7g"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 4,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "XJHW2J63QUKdh3bK3T2nzA"
    }
  } ] ]
}

As you can see, in the response returned by Elasticsearch, we have the information about the shards that will be used during the query process. Of course, with the search shards API, we can use additional parameters that control the querying process. These properties are routing, preference, and local. We are already familiar with the first two. The local parameter is a Boolean (values true or false) one that allows us to tell Elasticsearch to use the cluster state information stored on the local node (setting local to true) instead of the one from the master node (setting local to false). This allows us to diagnose problems with cluster state synchronization.

Summary

This article has been all about the querying Elasticsearch. We started by looking at how to query Elasticsearch and what Elasticsearch does when it needs to handle the query. We also learned about the basic and compound queries, so we are now able to use both simple queries as well as the ones that group multiple small queries together. Finally, we discussed how to choose the right query for a given use case.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here