Using Faceted Search, from Searching to Finding

0
96
10 min read

(For more resources related to this topic, see here.)

Looking at Solr’s standard query parameters

The basic engine of Solr is Lucene, so Solr accepts a query syntax based on the Lucene one, even if there are some minor differences, they should not affect our experiments, as they involve more advanced behavior. You can find an explanation on the Solr Query syntax on wiki at: http://wiki.apache.org/solr/SolrQuerySyntax.

Let’s see some example of a query using the basic parameters. Before starting our tests, we need to configure a new core again, in the usual way.

Sending Solr’s query parameters over HTTP

It is important to take care of the fact that our queries to Solr are sent over the HTTP protocol (unless we are using Solr in embedded mode, as we will see later). With cURL we can handle the HTTP encoding of parameters, for example:

>> curl -X POST 'http://localhost:8983/solr/paintings/select?start
=3&rows=2&fq=painting&wt=json&indent=true' --data-urlencode
'q=leonardo da vinci&fl=artist title'

This command can be instead of the following command:

>> curl -X GET "http://localhost:8983/solr/paintings/select?q
=leonardo%20da%20vinci&fq=painting&start=3&row=2&fl=artist%20title&wt
=json&indent=true"

Please note how using the –data-urlencode parameter in the example we can write the parameters values including characters which needs to be encoded over HTTP.

Testing HTTP parameters on browsers

On modern browsers such as Firefox or Chrome you can look at the parameters directly into the provided console. For example using Chrome you can open the console (with F12):

In the previous image you can see under Query String Parameters section on the right that the parameters are showed on a list, and we can easily switch between the encoded and the more readable un-encoded value’s version.

If don’t like using Chrome or Firefox and want a similar tool, you can try the Firebug lite (http://getfirebug.com/firebuglite). This is a JavaScript library conceived to port firebug plugin functionality ideally to every browser, by adding this library to your HTML page during the test process.

Choosing a format for the output

When sending a query to Solr directly (by the browser or cURL) we can ask for results in multiple formats, including for example JSON:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q
=*:*&wt=json&indent=true'

Time for action – searching all documents with pagination

When performing a query we need to remember we are potentially asking for a huge number of documents. Let’s observe how to manage partial results using pagination:

  1. For example think about the q=*:* query as seen in previous examples which was used for asking all the documents, without a specific criteria. In a case like this, in order to avoid problems with resources, Solr will send us actually only the first ones, as defined by a parameter in the configuration. The default number of returned results will be 10, so we need to be able to ask for a second group of results, and a third, and so on and on until there are. This is what is generally called a pagination of results, similarly as for scenarios involving SQL.
  2. Executing the command:

    >> curl -X GET "http://localhost:8983/solr/paintings/select?q
    =*:*&start=0&rows=0&wt=json&indent=true"

  3. We should obtain a result similar to this (the number of documents numFound and the time spent for processing query QTime could vary, depending on your data and your system):

In the previous image we see the same results in two different ways: on the right side you’ll recognize the output from cURL and on the left side of the browser you see how the results directly in the browser window.

In the second example we had the Json View plugin installed in the browser, which gives a very helpful visualization of JSON, with indentation and colors. You can install it if you want for Chrome at:

https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc

For Firefox the plugin can be installed from:

https://addons.mozilla.org/it/firefox/addon/jsonview/

Note how even if we have found 12484 documents, we are currently seeing none of them in the results!

What just happened?

In this very simple example, we already use two very useful parameters: start and rows, which we should always think as a couple, even if we may be using only one of them explicitly. We could change the default values for these parameters from the solrconfig.xml file, but this is generally not needed:

  • The start value defines the original index of the first document returned in the response, from the ones matching our search criteria, and starting from value 0. The default value will again start at 0.
  • The rows parameter is used to define how many documents we want in the results. The default value will be 10 for rows.

So if for example we only want the second and third document from the results, we can obtain them by the query:

>> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=1 &rows=2&wt=json&indent=true'

In order to obtain the second document in the results we need to remember that the enumeration starts from 0 (so the second will be at 1), while to see the next group of documents (if present), we will send a new query with values such as, start=10, rows=10, and so on. We are still using the wt and indent parameters only to have results formatted in a clear way.

The start/rows parameters play roles in this context which are quite similar to the OFFSET/LIMIT clause in SQL.

This process of segmenting the output to be able to read it in group or pages of results is usually called pagination, and it is generally handled by some programming code. You should know this mechanism, so you could play with your test even on a small segment of data without a loss of generalization. I strongly suggest you to always add these two parameters explicitly in your examples.

Time for action – projecting fields with fl

Another important parameter to consider is fl, that can be used for fields projection, obtaining only certain fields in the results:

  1. Suppose now that we are interested on obtaining the titles and artist reference for all the documents:

    >>curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =artist:*&wt=json&indent=true&omitHeader=true&fl=title,artist'

  2. We will obtain an output similar to the one shown in the following image:

  3. Note that the results will be indented as requested, and will not contain any header to be more readable. Moreover the parameters list does not need to be written in a specific order.
  4. The previous query could be rewritten also:

    >>curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =artist:*&wt=json&indent=true&omitHeader=true&fl=title&fl=artist'

Here we ask for field projection one by one, if needed (for example when using HTML and JavaScript widget to compose the query following user’s choices).

What just happened?

The fl parameter stands for fields list. By using this parameter we can define a comma-separated list of fields names that explicitly define what fields are projected in the results. We can also use a space to separate fields, but in this case we should use the URL encoding for the space, writing fl=title+artist or fl=title%20artist.

If you are familiar with relational databases and SQL, you should consider the fl parameter. It is similar to the SELECT clause in SQL statements, used to project the selected fields in the results. In a similar way writing fl=author:artist,title corresponds to the usage of aliases for example, SELECT artist AS author, title.

Let’s see the full list of parameters in details:

  • The parameter q=artist:* is used in this case in place of a more generic q=*:*, to select only the fields which have a value for the field artist. The special character * is used again for indicating all the values.
  • The wt=json, indent=true parameters are used for asking for an indented JSON format.
  • The omitHeader=true parameter is used for omit the header from the response.
  • The fl=title,artist parameter represents the list of the fields to be projected for the results.

Note how the fields are projected in the results without using the order asked in fl, as this has no particular sense for JSON output. This order will be used for the CSV response writer that we will see later, however, where changing the columns order could be mandatory.

In addition to the existing field, which can be added by using the * special character, we could also ask for the projection of the implicit score field. A composition of these two options could be seen in the following query:

>>curl -X GET 'http://localhost:8983/solr/paintings/select?q
=artist:*&wt=json&indent=true&omitHeader=true&fl=*,score'

This will return every field for every document, including the score field explicitly, which is sometimes called a pseudo-field, to distinguish it from the field defined by a schema.

Time for action – selecting documents with filter query

Sometimes it’s useful to be able to narrow the collection of documents on which we are currently performing our search. It is useful to add some kind of explicit linked condition on the logical side for navigation on data, and will also have good impact on performances too.

It is shown in the following example:

It shows how the default search is restricted by the introduction of a fq=annunciation condition.

What just happened?

The first result in this simple example shows that we obtain results similar to what we could have obtained by a simple q=annunciation search. Filtered query can be cached (as well as facets, that we will see later), improving performance by reducing the overhead of performing the same query many times, and accessing documents of large datasets to the same group many times.

In this case the analogy with SQL seems less convincing, but q=dali and fq=abstract:painting can be seen corresponding to WHERE conditions in SQL. The fq parameters will then be a fixed condition.

In our scenario, we could define for example specific endpoints with pre-defined filter query by author, to create specific channels. In this case instead of passing the parameters every time we could set them on solrconfig.xml.

Time for action – searching for similar terms with Fuzzy search

Even if the wildcard queries are very flexible, sometimes they simply cannot give us a good results. There could be some weird typo in the term, and we still want to obtain some good results wherever it is possible under certain confidence conditions:

  1. If I want to write painting and I actually search for plainthing, for example:

    >> curl – X GET 'http://localhost:8983/solr/paintings/select?q
    =abstract:plainthing~0.5&wt=json'

  2. Suppose we have a person using a different language, who searched for leonardo by misspelling the name:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =abstract:lionardo~0.5&wt=json'

In both cases the examples use misspelled words to be more recognizable, but the same syntax can be used for intercept existing similar words.

What just happened?

Both the preceding examples work as expected. The first gives us documents containing the term painting, the second gives us documents containing leonardo instead. Note that the syntax plainthing^0.5 represents a query that matches with a certain confidence, so for example we will also obtain occurrences of documents with the term paintings, which is good, but on a more general case we could receive weird results. In order to properly set up the confidence value there are not many options, apart from doing tests.

Using fuzzy search is a simple way to obtain a suggested result for alternate forms of search query, just like when we trust some search engine’s similar suggestions in the did you mean approaches.

LEAVE A REPLY

Please enter your comment!
Please enter your name here