Suggesters for Improving User Search Experience

10 min read

In this article by Bharvi Dixit, the author of the book Mastering ElasticSearch 5.0 – Third Edition, we will focus on the topics for improving the user search experience using suggesters, which allows you to correct user query spelling mistakes and build efficient autocomplete mechanisms. First, let’s look on the query possibilities and the responses returned by Elasticsearch. We will try to show you the general principles, and then we will get into more details about each of the available suggesters.

(For more resources related to this topic, see here.)

Using the suggester under search

Before Elasticsearch 5.0, there was a possibility to get suggestions for a given text by using a dedicated _suggest REST endpoint. But in Elasticsearch 5.0, this dedicated _suggest endpoint has been deprecated in favor of using suggest API. In this release, the suggest only search requests have been optimized for performance reasons and we can execute the suggetions _search endpoint. Similar to query object, we can use a suggest object and what we need to provide inside suggest object is the text to analyze and the type of used suggester (term or phrase). So if we would like to get suggestions for the words chrimes in wordl (note that we’ve misspelled the word on purpose), we would run the following query:

curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d'
{
  "suggest": {
    "first_suggestion": {
      "text": "chrimes in wordl",
        "term": {
          "field": "title"
        }
    }
  }
}'

The dedicated endpoint _suggest has been deprecated in Elasticsearch version 5.0 and might be removed in future releases, so be advised to use suggestion request under _search endpoint. All the examples covered in this article usage the same _search endpoint for suggest request.

As you can see, the suggestion request wrapped inside suggest object and is send to Elasticsearch in its own object with the name we chose (in the preceding case, it is first_suggestion). Next, we specify the text for which we want the suggestion to be returned using the text parameter. Finally, we add the suggester object, which is either term or phrase. The suggester object contains its configuration, which for the term suggester used in the preceding command, is the field we want to use for suggestions (the field property).

We can also send more than one suggestion at a time by adding multiple suggestion names. For example, if in addition to the preceding suggestion, we would also include a suggestion for the word arest, we would use the following command:

curl -XPOST "http://localhost:9200/wikinews/_search?pretty" -d'
{
  "suggest": {
    "first_suggestion": {
      "text": "chrimes in wordl",
        "term": {
          "field": "title"
        }
    },
    "second_suggestion": {
      "text": "arest",
        "term": {
          "field": "text"
        }
    }
  }
}'

Understanding the suggester response

Let’s now look at the example response for the suggestion query we have executed. Although the response will differ for each suggester type, let’s look at the response returned by Elasticsearch for the first command we’ve sent in the preceding code that used the term suggester:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "first_suggestion" : [
      {
        "text" : "chrimes",
        "offset" : 0,
        "length" : 7,
        "options" : [
          {
            "text" : "crimes",
            "score" : 0.8333333,
            "freq" : 36
          },
          {
            "text" : "choices",
            "score" : 0.71428573,
            "freq" : 2
          },
          {
            "text" : "chrome",
            "score" : 0.6666666,
            "freq" : 2
          },
          {
            "text" : "chimps",
            "score" : 0.6666666,
            "freq" : 1
          },
          {
            "text" : "crimea",
            "score" : 0.6666666,
            "freq" : 1
          }
        ]
      },
      {
        "text" : "in",
        "offset" : 8,
        "length" : 2,
        "options" : [ ]
      },
      {
        "text" : "wordl",
        "offset" : 11,
        "length" : 5,
        "options" : [
          {
            "text" : "world",
            "score" : 0.8,
            "freq" : 436
          },
          {
            "text" : "words",
            "score" : 0.8,
            "freq" : 6
          },
          {
            "text" : "word",
            "score" : 0.75,
            "freq" : 9
          },
          {
            "text" : "worth",
            "score" : 0.6,
            "freq" : 21
          },
          {
            "text" : "worst",
            "score" : 0.6,
            "freq" : 16
          }
        ]
      }
    ]
  }
}

As you can see in the preceding response, the term suggester returns a list of possible suggestions for each term that was present in the text parameter of our first_suggestion section. For each term, the term suggester will return an array of possible suggestions with additional information. Looking at the data returned for the wordl term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter).

The options array contains suggestions for the given word and will be empty if Elasticsearch doesn’t find any suggestions. Each entry in this array is a suggestion and is characterized by the following properties:

text: This is the text of the suggestion.
score: This is the suggestion score; the higher the score, the better the suggestion will be.
freq: This is the frequency of the suggestion. The frequency represents how many times the word appears in documents in the index we are running the suggestion query against. The higher the frequency, the more documents will have the suggested word in its fields and the higher the chance that the suggestion is the one we are looking for.

Please remember that the phrase suggester response will differ from the one returned by the terms suggester,

The term suggester

The term suggester works on the basis of the edit distance, which means that the suggestion with fewer characters that needs to be changed or removed to make the suggestion look like the original word is the best one. For example, let’s take the words worl and work. In order to change the worl term to work, we need to change the l letter to k, so it means a distance of one. Of course, the text provided to the suggester is analyzed and then terms are chosen to be suggested.

The phrase suggester

The term suggester provides a great way to correct user spelling mistakes on a per-term basis. However, if we would like to get back phrases, it is not possible to do that when using this suggester. This is why the phrase suggester was introduced. It is built on top of the term suggester and adds additional phrase calculation logic to it so that whole phrases can be returned instead of individual terms. It uses N-gram based language models to calculate how good the suggestion is and will probably be a better choice to suggest whole phrases instead of the term suggester. The N-gram approach divides terms in the index into grams—word fragments built of one or more letters. For example, if we would like to divide the word mastering into bi-grams (a two letter N-gram), it would look like this: ma as st te er ri in ng.

The completion suggester

Till now we read about term suggester and phrase suggester which are used for providing suggestions but completion suggester is completely different and it is used for as a prefix-based suggester for allowing us to create the autocomplete (search as you type) functionality in a very performance-effective way because of storing complicated structures in the index instead of calculating them during query time. This suggester is not about correcting user spelling mistakes.

In Elasticsearch 5.0, Completion suggester has gone through complete rewrite. Both the syntax and data structures of completion type field have been changed and so is the response structure. There are many new exciting features and speed optimizations have been introduced in the completion suggester. One of these features is making completion suggester near real time which allows deleted suggestions to omit from suggestion results as soon as they are deleted.

The logic behind the completion suggester

The prefix suggester is based on the data structure called Finite State Transducer (FST) ( For more information refer, http://en.wikipedia.org/wiki/Finite_state_transducer). Although it is highly efficient, it may require significant resources to build on systems with large amounts of data in them: systems that Elasticsearch is perfectly suitable for. If we would like to build such a structure on the nodes after each restart or cluster state change, we may lose performance. Because of this, the Elasticsearch creators decided to use an FST-like structure during index time and store it in the index so that it can be loaded into the memory when needed.

Using the completion suggester

To use a prefix-based suggester we need to properly index our data with a dedicated field type called completion. It stores the FST-like structure in the index. In order to illustrate how to use this suggester, let’s assume that we want to create an autocomplete feature to allow us to show book authors, which we store in an additional index. In addition to author’s names, we want to return the identifiers of the books they wrote in order to search for them with an additional query. We start with creating the authors index by running the following command:

curl -XPUT "http://localhost:9200/authors" -d'
{
  "mappings": {
    "author": {
      "properties": {
        "name": {
          "type": "keyword"
        },
        "suggest": {
          "type": "completion"
        }
      }
    }
  }
}'

Our index will contain a single type called author. Each document will have two fields: the name field, which is the name of the author, and the suggest field, which is the field we will use for autocomplete. The suggest field is the one we are interested in; we’ve defined it using the completion type, which will result in storing the FST-like structure in the index.

Implementing your own autocompletion

Completion suggester has been designed to be a powerful and easily implemented solution for autocomplete but it supports only prefix query. Most of the time autocomplete need only work as a prefix query for example, If I type elastic then I expect elasticsearch as a suggestion, not nonelastic. There are some use cases, when one wants to implement more general partial word completion. Completion suggester fails to fulfill this requirement. The second limitation of completion suggester is, it does not allow advance queries and filters searched.

To get rid of both these limitations we are going to implement a custom autocomplete feature based on N-gram, which works in almost all the scenarios.

Creating index

Lets create an index location-suggestion with following settings and mappings:

curl -XPUT "http://localhost:9200/location-suggestion" -d'
{
   "settings": {
      "index": {
         "analysis": {
            "filter": {
               "nGram_filter": {
                  "token_chars": [
                     "letter",
                     "digit",
                     "punctuation",
                     "symbol",
                     "whitespace"
                  ],
                  "min_gram": "2",
                  "type": "nGram",
                  "max_gram": "20"
               }
            },
            "analyzer": {
               "nGram_analyzer": {
                  "filter": [
                     "lowercase",
                     "asciifolding",
                     "nGram_filter"
                  ],
                  "type": "custom",
                  "tokenizer": "whitespace"
               },
               "whitespace_analyzer": {
                  "filter": [
                     "lowercase",
                     "asciifolding"
                  ],
                  "type": "custom",
                  "tokenizer": "whitespace"
               }
            }
         }
      }
   },
   "mappings": {
      "locations": {
         "properties": {
            "name": {
               "type": "text",
               "analyzer": "nGram_analyzer",
               "search_analyzer": "whitespace_analyzer"
            },
            "country": {
               "type": "keyword"
            }
         }
      }
   }
}'

Understanding the parameters

If you look carefully in preceding curl request for creating the index, it contains both settings and the mappings. We will see them now in detail one by one.

Configuring settings

Our settings contains two custom analyzers: nGram_analyzer and whitespace_analyzer. We have made custom whitespace_analyzer using whitespace tokenizer just for making due that all the tokens are indexed in lowercase and ascifolded form. Our main interest is nGram_analyzer, which contains a custom filter nGram_filter consisting following parameters:

type: Specifies type of token filters which is nGram in our case.
token_chars: Specifies what kind of characters are allowed in the generated tokens. Punctuations and special characters are generally removed from the token streams but in our example, we have intended to keep them. We have kept whitespace also so that a text which contains United States and a user searches for u s, United States still appears in the suggestion.
min_gram and max_gram: These two attributes set the minimum and maximum length of substrings that will generated and added to the lookup table. For example, according to our settings for the index, the token India will generate following tokens:
```
[ "di", "dia", "ia", "in", "ind", "indi", "india", "nd", "ndi",
  "ndia" ]
```

Configuring mappings

The document type of our index is locations and it has two fields, name and country. The most important thing to see is the way analyzers has been defined for name field which will be used for autosuggestion. For this field we have set index analyzer to our custom nGram_analyzer where the search analyzer is set to whitespace_analyzer.

The index_analyzer parameter is no more supported from Elasticsearch version 5.0 onward. Also, if you want to configure search_analyzer property for a field, then you must configure analyzer property too the way we have shown in the preceding example.

Summary

In this article we focused on improving user search experience. We started with term and phrase suggesters and then covered search as you type that is, autocompletion feature which is implemented using completion suggester. We also saw the limitations of completion suggester in handling advanced queries and partial matching which further solved by implementing our custom completion using N-gram.