5 min read

Computing statistics for the search results

Imagine a situation where you want to compute some basic statistics about the documents in the results list. For example, you have an e-commerce shop where you want to show the minimum and the maximum price of the documents that were found for a given query. Of course, you could fetch all the documents and count it by yourself, but imagine if Solr can do it for you. Yes it can and this recipe will show you how to use that functionality.

How to do it…

Let’s start with the index structure (just add this to the fields section of your schema.xml file):

<field name=”id” type=”string” indexed=”true” stored=”true”
required=”true” />
<field name=”name” type=”text” indexed=”true” stored=”true” />
<field name=”price” type=”float” indexed=”true” stored=”true” />


The example data file looks like this:

<add>
<doc>
<field name=”id”>1</field>
<field name=”name”>Book 1</field>
<field name=”price”>39.99</field>
</doc>
<doc>
<field name=”id”>2</field>
<field name=”name”>Book 2</field>
<field name=”price”>30.11</field>
</doc>
<doc>
<field name=”id”>3</field>
<field name=”name”>Book 3</field>
<field name=”price”>27.77</field>
</doc>
</add>


Let’s assume that we want our statistics to be computed for the price field. To do that, we send the following query to Solr:

http://localhost:8983/solr/select?q=name:book&stats=true&stats.
field=price


The response Solr returned should be like this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<response>
<lst name=”responseHeader”>
<int name=”status”>0</int>
<int name=”QTime”>0</int>
<lst name=”params”>
<str name=”q”>name:book</str>
<str name=”stats”>true</str>
<str name=”stats.field”>price</str>
</lst>
</lst>
<result name=”response” numFound=”3″ start=”0″>
<doc>
<str name=”id”>1</str>
<str name=”name”>Book 1</str>
<float name=”price”>39.99</float>
</doc>
<doc>
<str name=”id”>2</str>
<str name=”name”>Book 2</str>
<float name=”price”>30.11</float>
</doc>
<doc>
<str name=”id”>3</str>
<str name=”name”>Book 3</str>
<float name=”price”>27.77</float>
</doc>
</result>
<lst name=”stats”>
<lst name=”stats_fields”>
<lst name=”price”>
<double name=”min”>27.77</double>
<double name=”max”>39.99</double>
<double name=”sum”>97.86999999999999</double>
<long name=”count”>3</long>
<long name=”missing”>0</long>
<double name=”sumOfSquares”>3276.9851000000003</double>
<double name=”mean”>32.62333333333333</double>
<double name=”stddev”>6.486118510583508</double>
</lst>
</lst>
</lst>
</response>


As you can see, in addition to the standard results list, there was an additional section available. Now let’s see how it works.

How it works…

The index structure is pretty straightforward. It contains three fields—one for holding the unique identifier (the id field), one for holding the name (the name field), and one for holding the price (the price field).

The file that contains the example data is simple too, so I’ll skip discussing it.

The query is interesting. In addition to the q parameter, we have two new parameters. The first one, stats=true, tells Solr that we want to use the StatsComponent, the component which will calculate the statistics for us. The second parameter, stats.field=price, tells the StatsComponent which field to use for the calculation. In our case, we told Solr to use the price field.

Now let’s look at the result returned by Solr. As you can see, the StatsComponent added an additional section to the results. This section contains the statistics generated for the field we told Solr we want statistics for. The following statistics are available:

  • min: The minimum value that was found in the field for the documents that matched the query
  • max: The maximum value that was found in the field for the documents that matched the query
  • sum: Sum of all values in the field for the documents that matched the query
  • count: How many non-null values were found in the field for the documents that matched the query
  • missing: How many documents that matched the query didn’t have any value in the specified field
  • sumOfSquares: Sum of all values squared in the field for the documents that matched the query
  • mean: The average for the values in the field for the documents that matched the query
  • stddev: The standard deviation for the values in the field for the documents that matched the query

You should also remember that you can specify multiple stats.field parameters to calculate statistics for different fields in a single query.

Please be careful when using this component on the multi-valued fields. It can sometimes be a performance bottleneck.

LEAVE A REPLY

Please enter your comment!
Please enter your name here