Faceting in Solr 1.4 Enterprise Search Server

0
161
9 min read

(For more resources on Solr, see here.)

Faceting, after searching, is arguably the second-most valuable feature in Solr. It is perhaps even the most fun you’ll have, because you will learn more about your data than with any other feature. Faceting enhances search results with aggregated information over all of the documents found in the search to answer questions such as the ones mentioned  below, given a search on MusicBrainz releases:

  • How many are official, bootleg, or promotional?
  • What were the top five most common countries in which the releases occurred?
  • Over the past ten years, how many were released in each year?
  • How many have names in these ranges: A-C, D-F, G-I, and so on?
  • Given a track search, how many are < 2 minutes long, 2-3, 3-4, or more?

Moreover, in addition, it can power term-suggest aka auto-complete functionality, which enables your search application to suggest a completed word that the user is typing, which is based on the most commonly occurring words starting with what they have already typed. So if a user started typing siamese dr, then Solr might suggest that dreams is the most likely word, along with other alternatives.

Faceting, sometimes referred to as faceted navigation, is usually used to power user interfaces that display this summary information with clickable links that apply Solr filter queries to a subsequent search.

If we revisit the comparison of search technology to databases, then faceting is more or less analogous to SQL’s group by feature on a column with count(*). However, in Solr, facet processing is performed subsequent to an existing search as part of a single request-response with both the primary search results and the faceting results coming back together. In SQL, you would need to potentially perform a series of separate queries to get the same information.

A quick example: Faceting release types

Observe the following search results. echoParams is set to explicit (defined in solrconfig.xml) so that the search parameters are seen here. This example is using the standard handler (though perhaps dismax is more typical). The query parameter q is *:*, which matches all documents. In this case, the index I’m using only has releases. If there were non-releases in the index, then I would add a filter fq=type%3ARelease to the URL or put this in the handler configuration, as that is the data set we’ll be using for most of this article. I wanted to keep this example brief so I set rows to 2. Sometimes when using faceting, you only want the facet information and not the main search, so you would set rows to 0, if that is the case.

It’s important to understand that the faceting numbers are computed over the entire search result, which is all of the releases in this example, and not just the two rows being returned.

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">160</int>
<lst name="params">
<str name="wt">standard</str>
<str name="rows">2</str>
<str name="facet">true</str>
<str name="q">*:*</str>
<str name="fl">*,score</str>
<str name="qt">standard</str>
<str name="facet.field">r_official</str>
<str name="f.r_official.facet.missing">true</str>
<str name="f.r_official.facet.method">enum</str>
<str name="indent">on</str>
</lst>
</lst>
<result name="response" numFound="603090" start="0" maxScore="1.0">
<doc>
<float name="score">1.0</float>
<str name="id">Release:136192</str>
<str name="r_a_id">3143</str>
<str name="r_a_name">Janis Joplin</str>
<arr name="r_attributes"><int>0</int><int>9</int>
<int>100</int></arr>
<str name="r_name">Texas International Pop Festival 11-30-69</str>
<int name="r_tracks">7</int>

<str name="type">Release</str>
</doc>
<doc>
<float name="score">1.0</float>
<str name="id">Release:133202</str>
<str name="r_a_id">6774</str>
<str name="r_a_name">The Dubliners</str>
<arr name="r_attributes"><int>0</int></arr>
<str name="r_lang">English</str>
<str name="r_name">40 Jahre</str>
<int name="r_tracks">20</int>
<str name="type">Release</str>
</doc>
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="r_official">
<int name="Official">519168</int>
<int name="Bootleg">19559</int>
<int name="Promotion">16562</int>
<int name="Pseudo-Release">2819</int>
<int>44982</int>
</lst>
</lst>
<lst name="facet_dates"/>
</lst>
</response>

The facet related search parameters are highlighted at the top. The facet.missing parameter was set using the field-specific syntax, which will be explained shortly.

Notice that the facet results (highlighted) follow the main search result and are given a name facet_counts. In this example, we only faceted on one field, r_official, but you’ll learn in a bit that you can facet on as many fields as you desire. The name attribute holds a facet value, which is simply an indexed term, and the integer following it is the number of documents in the search results containing that term, aka a facet count. The next section gives us an explanation of where r_official and r_type came from.

MusicBrainz schema changes

In order to get better self-explanatory faceting results out of the r_attributes field and to split its dual-meaning, I modified the schema and added some text analysis. r_attributes is an array of numeric constants, which signify various types of releases and it’s official-ness, for lack of a better word. As it represents two different things, I created two new fields: r_type and r_official with copyField directives to copy r_attributes into them:

<field name="r_attributes" type="integer" multiValued="true" 
indexed="false" /><!-- ex: 0, 1, 100 -->
<field name="r_type" type="rType" multiValued="true"
stored="false" /><!-- Album | Single | EP |... etc. -->
<field name="r_official" type="rOfficial" multiValued="true"
stored="false" /><!-- Official | Bootleg | Promotional -->

And:

<copyField source="r_attributes" dest="r_type" />
<copyField source="r_attributes" dest="r_official" />

In order to map the constants to human-readable definitions, I created two field types: rType and rOfficial that use a regular expression to pull out the desired numbers and a synonym list to map from the constant to the human readable definition. Conveniently, the constants for r_type are in the range 1-11, whereas r_official are 100-103. I removed the constant 0, as it seemed to be bogus.

<fieldType name="rType" class="solr.TextField" sortMissingLast="true" 
omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(0|1dd)$" replacement="" replace="first" />
<filter class="solr.LengthFilterFactory" min="1" max="100" />
<filter class="solr.SynonymFilterFactory" synonyms="mb_attributes.txt"
ignoreCase="false" expand="false"/>
</analyzer>
</fieldType>

The definition of the type rOfficial is the same as rType, except it has this regular expression: ^(0|dd?)$.

The presence of LengthFilterFactory is to ensure that no zero-length (empty-string) terms get indexed. Otherwise, this would happen because the previous regular expression reduces text fitting unwanted patterns to empty strings.

The content of mb_attributes.txt is as follows:

# from: http://bugs.musicbrainz.org/browser/mb_server/trunk/
# cgi-bin/MusicBrainz/Server/Release.pm#L48
#note: non-album track seems bogus; almost everything has it
0=>Non-Album Track
1=>Album
2=>Single
3=>EP
4=>Compilation
5=>Soundtrack
6=>Spokenword
7=>Interview
8=>Audiobook
9=>Live
10=>Remix
11=>Other
100=>Official
101=>Promotion
102=>Bootleg
103=>Pseudo-Release

It does not matter if the user interface uses the name (for example: Official) or constant (for example: 100) when applying filter queries when implementing faceted navigation, as the text analysis will let the names through and will map the constants to the names. This is not necessarily true in a general case, but it is for the text analysis as I’ve configured it above.

The approach I took was relatively simple, but it is not the only way to do it. Alternatively, I might have split the attributes and/or mapped them as part of the import process. This would allow me to remove the multiValued setting in r_official. Moreover, it wasn’t truly necessary to map the numbers to their names, as a user interface, which is going to present the data, could very well map it on the fly.

Field requirements

The principal requirement of a field that will be faceted on is that it must be indexed. In addition to all but the prefix faceting use case, you will also want to use text analysis that does not tokenize the text. For example, the value Non-Album Track is indexed the way it is in r_type. We need to be careful to escape the space where this appeared in mb_attributes.txt. Otherwise, faceting on this field would show tallies for Non-Album and Track separately. Depending on the type of faceting you want to do and other needs you have like sorting, you will often find it necessary to have a copy of a field just for faceting. Remember that with faceting, the facet values returned in search results are the actual terms indexed, and not the stored value, which isn’t even used.

LEAVE A REPLY

Please enter your comment!
Please enter your name here