Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

The scripting Capabilities of Elasticsearch

Save for later
  • 19 min read
  • 08 Jan 2016

article-image

In this article by Rafał Kuć and Marek Rogozinski author of the book Elasticsearch Server - Third Edition, Elasticsearch has a few functionalities in which scripts can be used. Even though scripts seem to be a rather advanced topic, we will look at the possibilities offered by Elasticsearch. That's because scripts are priceless in certain situations.

Elasticsearch can use several languages for scripting. When not explicitly declared, it assumes that Groovy (http://www.groovy-lang.org/) is used. Other languages available out of the box are the Lucene expression language and Mustache (https://mustache.github.io/). Of course, we can use plugins that will make Elasticsearch understand additional scripting languages such as JavaScript, Mvel, or Python. One thing worth mentioning is this: independently from the scripting language that we will choose, Elasticsearch exposes objects that we can use in our scripts. Let's start by briefly looking at what type of information we are allowed to use in our scripts.

(For more resources related to this topic, see here.)

Objects available during script execution

During different operations, Elasticsearch allows us to use different objects in our scripts. To develop a script that fits our use case, we should be familiar with those objects.

For example, during a search operation, the following objects are available:

  • _doc (also available as doc): An instance of the org.elasticsearch.search.lookup.LeafDocLookup object. It gives us access to the current document found with the calculated score and field values.
  • _source: An instance of the org.elasticsearch.search.lookup.SourceLookup object. It provides access to the source of the current document and the values defined in the source.
  • _fields: An instance of the org.elasticsearch.search.lookup.LeafFieldsLookup object. It can be used to access the values of the document fields.

On the other hand, during a document update operation, the variables mentioned above are not accessible. Elasticsearch exposes only the ctx object with the _source property, which provides access to the document currently processed in the update request.

As we have previously seen, several methods are mentioned in the context of document fields and their values. Let's now look at the examples of how to get the value for a particular field using the previously mentioned object available during search operations. In the brackets, you can see what Elasticsearch will return for one of our example documents from the library index (we will use the document with identifier 4):

  • _doc.title.value (and)
  • _source.title (crime and punishment)
  • _fields.title.value (null)

A bit confusing, isn't it? During indexing, the original document is, by default, stored in the _source field. Of course, by default, all fields are present in that _source field. In addition to this, the document is parsed, and every field may be stored in an index if it is marked as stored (that is, if the store property is set to true; otherwise, by default, the fields are not stored). Finally, the field value may be configured as indexed. This means that the field value is analyzed and placed in the index. To sum up, one field may land in an Elasticsearch index in the following ways:

  • As part of the _source document
  • As a stored and unparsed original value
  • As an indexed value that is processed by an analyzer

In scripts, we have access to all of these field representations. The only exception is the update operation, which—as we've mentioned before—gives us access to  only the _source document as part of the ctx variable. You may wonder which version you should use. Well, if we want access to the processed form, the answer would be simple—use the _doc object. What about _source and _fields? In most cases, _source is a good choice. It is usually fast and needs fewer disk operations than reading the original field values from the index. This is especially true when you need to read values of multiple fields in your scripts—fetching a single _source field is faster than fetching multiple independent fields from the index.

Script types

Elasticsearch allows us to use scripts in three different ways:

  • Inline scripts: The source of the script is directly defined in the query
  • In-file scripts: The source is defined in the external file placed in the Elasticsearch config/scripts directory
  • As a document in the dedicated index: The source of the script is defined as a document in a special index available by using the /_scripts API endpoint

Choosing the way of defining scripts depends on several factors. If you have scripts that you will use in many different queries, the file or the dedicated index seems to be the best solution. "Scripts in the file" is probably less convenient, but it is preferred from the security point of view—they can't be overwritten and injected into your query, which might have caused a security breach.

In-file scripts

This is the only way that is turned on by default in Elasticsearch. The idea is that every script used by the queries is defined in its own file placed in the config/scripts directory. We will now look at this method of using scripts. Let's create an example file called tag_sort.groovy and place it in the config/scripts directory of our Elasticsearch instance (or instances if we are running a cluster). The content of the mentioned file should look like this:

_doc.tags.values.size() > 0 ? _doc.tags.values[0] : 'u19999'

After a few seconds, Elasticsearch should automatically load a new file. You should see something like this in the Elasticsearch logs:

[2015-08-30 13:14:33,005][INFO ][script                   ] [Alex Wilder] compiling script file [/Users/negativ/Developer/ES/es-current/config/scripts/tag_sort.groovy]

If you have a multinode cluster, you have to make sure that the script is available on every node.

Now we are ready to use this script in our queries. A modified query that uses our script stored in the file looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "file" : "tag_sort"

       },

       "type" : "string",

       "order" : "asc"

     }

  }

}'

First, we will see the next possible way of defining a script inline.

Inline scripts

Inline scripts are a more convenient way of using scripts, especially for constantly changing queries or ad-hoc queries. The main drawback of such an approach is security. If we do this, we allow users to run any kind of query, including any kind of script that can be used by attackers. Such an attack can execute arbitrary code on the server running Elasticsearch with rights equal to the ones given to the user who is running Elasticsearch. In the worst-case scenario, an attacker could use security holes to gain superuser rights. This is why inline scripts are disabled by default. After careful consideration, you can enable them by adding this to the elasticsearch.yml file:

script.inline: on

After allowing the inline script to be executed, we can run a query that looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "inline" : "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""

       },

       "type" : "string",

       "order" : "asc"

     }

  }

}'

Indexed scripts

The last option for defining scripts is to store them in the dedicated Elasticsearch index. From the same security reasons, dynamic execution of indexed scripts is by default disabled. To enable indexed scripts, we have to add a configuration similar option to the one that we've added to be able to use inline scripts. We need to add the following line to the elasticsearch.yml file:

script.indexed: on

After adding the above property to all the nodes and restarting the cluster, we will be ready to start using indexed scripts. Elasticsearch provides additional dedicated endpoints for this purpose. Let's store our script:

curl -XPOST 'localhost:9200/_scripts/groovy/tag_sort' -d '{

  "script" :  "_doc.tags.values.size() > 0 ? _doc.tags.values[0] : "u19999""

}'

The script is ready, but let's discuss what we just did. We sent an HTTP POST request to the special _scripts REST endpoint. We also specified the language of the script (groovy in our case) and the name of the script (tag_sort). The body of the request is the script itself.

We can now move on to the query, which looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "id" : "tag_sort"

       },

       "type" : "string",

       "order" : "asc"

     }

  }

}'

As we can see, this query is practically identical to the query used with the script defined in a file. The only difference is the id parameter instead of file.

Querying with scripts

If we look at any request made to Elasticsearch that uses scripts, we will notice some similar properties, which are as follows:

  • script: The property that wraps the script definition.
  • inline: The property holding the code of the script itself.
  • id – This is the property that defines the identifier of the indexed script.
  • file: The filename (without extension) with the script definition when the in file script is used.
  • lang: This is the property defining the script language. If it is omitted, Elasticsearch assumes groovy.
  • params: This is an object containing parameters and their values. Every defined parameter can be used inside the script by specifying that parameter name. Parameters allow us to write cleaner code that will be executed in a more efficient manner. Scripts that use parameters are executed faster than code with embedded constants because of caching.

Scripting with parameters

As our scripts become more and more complicated, the need for creating multiple, almost identical scripts can appear. Those scripts usually differ in the values used, with the logic behind them being exactly the same. In our simple example, we have used a hardcoded value to mark documents with an empty tags list. Let's change this to allow the definition of a hardcoded value. Let's use in the file script definition and create the tag_sort_with_param.groovy file with the following contents:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
_doc.tags.values.size() > 0 ? _doc.tags.values[0] : tvalue

The only change we've made is the introduction of a parameter named tvalue, which can be set in the query in the following way:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "file" : "tag_sort_with_param",

        "params" : {

          "tvalue" : "000"

        }

       },

       "type" : "string",

       "order" : "asc"

     }

  }

}'

The params section defines all the script parameters. In our simple example, we've only used a single parameter, but of course, we can have multiple parameters in a single query.

Script languages

The default language for scripting is Groovy. However, you are not limited to only a single scripting language when using Elasticsearch. In fact, if you would like to, you can even use Java to write your scripts. In addition to that, the community behind Elasticsearch provides support of more languages as plugins. So, if you are willing to install plugins, you can extend the list of scripting languages that Elasticsearch supports even further. You may wonder why you should even consider using a scripting language other than the default Groovy. The first reason is your own preferences. If you are a Python enthusiast, you are probably now thinking about how to use Python for your Elasticsearch scripts. The other reason could be security. When we talked about inline scripts, we told you that inline scripts are turned off by default. This is not exactly true for all the scripting languages available out of the box. Inline scripts are disabled by default when using Grooby, but you can use Lucene expressions and Mustache without any issues. This is because those languages are sandboxed, which means that security-sensitive functions are turned off. And of course, the last factor when choosing the language is performance. Theoretically, native scripts (in Java) should have better performance than others, but you should remember that the difference can be insignificant. You should always consider the cost of development and measure the performance.

Using something other than embedded languages

Using Groovy for scripting is a simple and sufficient solution for most use cases. However, you may have a different preference and you would like to use something different, such as JavaScript, Python, or Mvel.

For now, we'll just run the following command from the Elasticsearch directory:

bin/plugin install elasticsearch/elasticsearch-lang-javascript/2.7.0

The preceding command will install a plugin that will allow the use of JavaScript as the scripting language. The only change we should make in the request is putting in additional information about the language we are using for scripting. And of course, we have to modify the script itself to correctly use the new language. Look at the following example:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "inline" : "_doc.tags.values.length > 0 ? _doc.tags.values[0] :"u19999";",

        "lang" : "javascript"

      },

      "type" : "string",

      "order" : "asc"

    }

  }

}'

As you can see, we've used JavaScript for scripting instead of the default Groovy. The lang parameter informs Elasticsearch about the language being used.

Using native code

If the scripts are too slow or if you don't like scripting languages, Elasticsearch allows you to write Java classes and use them instead of scripts. There are two possible ways of adding native scripts: adding classes that define scripts to the Elasticsearch classpath, or adding a script as a functionality provided by plugin. We will describe the second solution as it is more elegant.

The factory implementation

We need to implement at least two classes to create a new native script. The first one is a factory for our script. For now, let's focus on it. The following sample code illustrates the factory for our script:

package pl.solr.elasticsearch.examples.scripts;

import java.util.Map;

import org.elasticsearch.common.Nullable;

import org.elasticsearch.script.ExecutableScript;

import org.elasticsearch.script.NativeScriptFactory;

public class HashCodeSortNativeScriptFactory implements NativeScriptFactory {

    @Override

    public ExecutableScript newScript(@Nullable Map<String, Object> params) {

        return new HashCodeSortScript(params);

    }

  @Override

  public boolean needsScores() {

    return false;

  }

}

This class should implement the org.elasticsearch.script.NativeScriptFactory class. The interface forces us to implement two methods. The newScript() method takes the parameters defined in the API call and returns an instance of our script. Finally, needsScores() informs Elasticsearch if we want to use scoring and that it should be calculated.

Implementing the native script

Now let's look at the implementation of our script. The idea is simple—our script will be used for sorting. The documents will be ordered by the hashCode() value of the chosen field. Documents without a value in the defined field will be first on the results list. We know that the logic doesn't make much sense, but it is good for presentation as it is simple. The source code for our native script looks as follows:

package pl.solr.elasticsearch.examples.scripts;

import java.util.Map;

import org.elasticsearch.script.AbstractSearchScript;

public class HashCodeSortScript extends AbstractSearchScript {

  private String field = "name";

  public HashCodeSortScript(Map<String, Object> params) {

    if (params != null && params.containsKey("field")) {

      this.field = params.get("field").toString();

    }

  }

  @Override

  public Object run() {

    Object value = source().get(field);

    if (value != null) {

      return value.hashCode();

    }

    return 0;

  }

}

First of all, our class inherits from the org.elasticsearch.script.AbstractSearchScript class and implements the run() method. This is where we get the appropriate values from the current document, process it according to our strange logic, and return the result. You may notice the source() call. Yes, it is exactly the same _source parameter that we met in the non-native scripts. The doc() and fields() methods are also available, and they follow the same logic that we described earlier.

The thing worth looking at is how we've used the parameters. We assume that a user can put the field parameter, telling us which document field will be used for manipulation. We also provide a default value for this parameter.

The plugin definition

We said that we will install our script as a part of a plugin. This is why we need additional files. The first file is the plugin initialization class, where we can tell Elasticsearch about our new script:

package pl.solr.elasticsearch.examples.scripts;

import org.elasticsearch.plugins.Plugin;

import org.elasticsearch.script.ScriptModule;

public class ScriptPlugin extends Plugin {

  @Override

  public String description() {

   return "The example of native sort script";

  }

  @Override

  public String name() {

    return "naive-sort-plugin";

  }

  public void onModule(final ScriptModule module) {

    module.registerScript("native_sort", 
      HashCodeSortNativeScriptFactory.class);

  }

}

The implementation is easy. The description() and name() methods are only for information purposes, so let's focus on the onModule() method. In our case, we need access to script module—the Elasticsearch service connected with scripts and scripting languages. This is why we define onModule() with one ScriptModule argument. Thanks to Elasticsearch magic, we can use this module and register our script so that it can be found by the engine. We have used the registerScript() method, which takes the script name and the previously defined factory class.

The second file needed is a plugin descriptor file: plugin-descriptor.properties. It defines the constants used by the Elasticsearch plugin subsystem. Without thinking more, let's look at the contents of this file:

jvm=true

classname=pl.solr.elasticsearch.examples.scripts.ScriptPlugin

elasticsearch.version=2.0.0-beta2-SNAPSHOT

version=0.0.1-SNAPSHOT

name=native_script

description=Example Native Scripts

java.version=1.7

The appropriate lines have the following meaning:

  • jvm: This tells Elasticsearch that our file contains Java code
  • classname: This describes the main class with the plugin definition
  • elasticsearch.version and java.version: They tell about the Elasticsearch and Java versions needed for our plugin
  • name and description: These are an informative name and a short description of our plugin

And that's it! We have all the files needed to fire our script. Note that now it is quite convenient to add new scripts and pack them as a single plugin.

Installing a plugin

Now it's time to install our native script embedded in the plugin. After packing the compiled classes as a JAR archive, we should put it into the Elasticsearch plugins/native-script directory. The native-script part is a root directory for our plugin and you may name it as you wish. In this directory, you also need the prepared plugin-descriptor.properties file. This makes our plugin visible to Elasticsearch.

Running the script

After restarting Elasticsearch (or the entire cluster if you are running more than a single node), we can start sending the queries that use our native script. For example, we will send a query that uses our previously indexed data from the library index. This example query looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{

  "query" : {

    "match_all" : { }

  },

  "sort" : {

    "_script" : {

      "script" : {

        "script" : "native_sort",

        "lang" : "native",

        "params" : {

          "field" : "otitle"

        }

      },

      "type" : "string",

      "order" : "asc"

    }

  }

}'

Note the params part of the query. In this call, we want to sort on the otitle field. We provide the script name as native_sort and the script language as native. This is required. If everything goes well, we should see our results sorted by our custom sort logic. If we look at the response from Elasticsearch, we will see that documents without the otitle field are at the first few positions of the results list and their sort value is 0.

Summary

In this article, we focused on querying, but not about the matching part of it—mostly about scoring. You learned how Apache Lucene TF/IDF scoring works. We saw the scripting capabilities of Elasticsearch and handled multilingual data. We also used boosting to influence how scores of returned documents were calculated and we used synonyms. Finally, we used explain information to see how document scores were calculated by query.

Resources for Article:

 


Further resources on this subject: