Apache Solr: Analyzing your Text Data

0
109
13 min read

 

Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook

Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server

        Read more about this book      

(For more resources on this subject, see here.)

Introduction

Type’s behavior can be defined in the context of the indexing process or the context of the query process, or both. Furthermore, type definition is composed of tokenizers and filters (both token filters and character filters). Tokenizer specifies how your data will be preprocessed after it is sent to the appropriate field. Analyzer operates on the whole data that is sent to the field. Types can only have one tokenizer. The result of the tokenizer work is a stream of objects called tokens. Next in the analysis chain are the filters. They operate on the tokens in the token stream. And they can do anything with the tokens—changing them, removing them, or for example, making them lowercase. Types can have multiple filters.

One additional type of filter is the character filter. The character filters do not operate on tokens from the token stream. They operate on the data that is sent to the field and they are invoked before the data is sent to the analyzer.

This article will focus on the data analysis and how to handle the common day-to-day analysis questions and problems.

Storing additional information using payloads

Imagine that you have a powerful preprocessing tool that can extract information about all the words in the text. Your boss would like you to use it with Solr or at least store the information it returns in Solr. So what can you do? We can use something that is called payload and use it to store that data. This recipe will show you how to do it.

How to do it…

I assumed that we already have an application that takes care of recognizing the part of speech in our text data. Now we need to add it to the Solr index. To do that we will use payloads, a metadata that can be stored with each occurrence of a term.

First of all, you need to modify the index structure. For this, we will add the new field type to the schema.xml file:

<fieldtype name="partofspeech" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory"
encoder="integer" delimiter="|"/>
</analyzer>
</fieldtype>

Now add the field definition part to the schema.xml file:

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="text" type="text" indexed="true" stored="true" />
<field name="speech" type="partofspeech" indexed="true" stored=
"true" multivalued="true" />

Now let’s look at what the example data looks like (I named it ch3_payload.xml):

<add>
<doc>
<field name="id">1</field>
<field name="text">ugly human</field>
<field name="speech">ugly|3 human|6</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">big book example</field>
<field name="speech">big|3 book|6 example|1</field>
</doc>
</add>

Let’s index our data. To do that, we run the following command from the exampledocs directory (put the ch3_payload.xml file there):

java -jarpost.jar ch3_payload.xml

How it works…

What information can payload hold? It may hold information that is compatible with the encoder type you define for the solr.DelimitedPayloadTokenFilterFactory filter . In our case, we don’t need to write our own encoder—we will use the supplied one to store integers. We will use it to store the boost of the term. For example, nouns will be given a token boost value of 6, while the adjectives will be given a boost value of 3.

First we have the type definition. We defined a new type in the schema.xml file, named partofspeech based on the Solr text field (attribute class=”solr.TextField”). Our tokenizer splits the given text on whitespace characters. Then we have a new filter which handles our payloads. The filter defines an encoder, which in our case is an integer (attribute encoder=”integer”). Furthermore, it defines a delimiter which separates the term from the payload. In our case, the separator is the pipe character |.

Next we have the field definitions. In our example, we only define three fields:

  • Identifier
  • Text
  • Recognized speech part with payload

 

Now let’s take a look at the example data. We have two simple fields: id and text. The one that we are interested in is the speech field. Look how it is defined. It contains pairs which are made of a term, delimiter, and boost value. For example, book|6. In the example, I decided to boost the nouns with a boost value of 6 and adjectives with the boost value of 3. I also decided that words that cannot be identified by my application, which is used to identify parts of speech, will be given a boost of 1. Pairs are separated with a space character, which in our case will be used to split those pairs. This is the task of the tokenizer which we defined earlier.

To index the documents, we use simple post tools provided with the example deployment of Solr. To use it, we invoke the command shown in the example. The post tools will send the data to the default update handler found under the address http://localhost:8983/solr/update. The following parameter is the file that is going to be sent to Solr. You can also post a list of files, not just a single one.

That is how you index payloads in Solr. In the 1.4.1 version of Solr, there is no further support for payloads. Hopefully this will change. But for now, you need to write your own query parser and similarity class (or extend the ones present in Solr) to use them.

Eliminating XML and HTML tags from the text

There are many real-life situations when you have to clean your data. Let’s assume that you want to index web pages that your client sends you. You don’t know anything about the structure of that page—one thing you know is that you must provide a search mechanism that will enable searching through the content of the pages. Of course, you could index the whole page by splitting it by whitespaces, but then you would probably hear the clients complain about the HTML tags being searchable and so on. So, before we enable searching on the contents of the page, we need to clean the data. In this example, we need to remove the HTML tags. This recipe will show you how to do it with Solr.

How to do it…

Let’s suppose our data looks like this (the ch3_html.xml file):

<add>
<doc>
<field name="id">1</field>
<field name="html"><![CDATA[<html><head><title>My page</title></
head><body><p>This is a <b>my</b><i>sample</i> page</body></html>
]]></field>
</doc>
</add>

Now let’s take care of the schema.xml file. First add the type definition to the schema.xml file:

<fieldType name="html_strip" class="solr.TextField">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

And now, add the following to the field definition part of the schema.xml file:

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="html" type="html_strip" indexed="true" stored="false"/>

Let’s index our data. To do that, we run the following command from the exampledocs directory (put the ch3_html.xml file there):

java -jar post.jar ch3_html.xml

If there were no errors, you should see a response like this:

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded
in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTingfile ch3_html.xml
SimplePostTool: COMMITting Solr index changes..

How it works…

First of all, we have the data example. In the example, we see one file with two fields; the identifier and some HTML data nested in the CDATA section. You must remember to surround the HTML data in CDATA tags if they are full pages, and start from HTML tags like our example, otherwise Solr will have problems with parsing the data. However, if you only have some tags present in the data, you shouldn’t worry.

Next, we have the html_strip type definition. It is based on solr.TextField to enable full-text searching. Following that, we have a character filter which handles the HTML and the XML tags stripping. This is something new in Solr 1.4. The character filters are invoked before the data is sent to the tokenizer. This way they operate on untokenized data. In our case, the character filter strips the HTML and XML tags, attributes, and so on. Then it sends the data to the tokenizer, which splits the data by whitespace characters. The one and only filter defined in our type makes the tokens lowercase to simplify the search.

To index the documents, we use simple post tools provided with the example deployment of Solr. To use it we invoke the command shown in the example. The post tools will send the data to the default update handler found under the address http://localhost:8983/solr/ update. The parameter of the command execution is the file that is going to be sent to Solr. You can also post a list of files, not just a single one.

As you can see, the sample response from the post tools is rather informative. It provides information about the update handler address, files that were sent, and information about commits being performed.

If you want to check how your data was indexed, remember not to be mistaken when you choose to store the field contents (attribute stored=”true”). The stored value is the original one sent to Solr, so you won’t be able to see the filters in action. If you wish to check the actual data structures, please take a look at the Luke utility (a utility that lets you see the index structure, field values, and operate on the index). Luke can be found at the following address: http://code.google.com/p/luke

Solr provides a tool that lets you see how your data is analyzed. That tool is a part of Solr administration pages.

Copying the contents of one field to another

Imagine that you have many big XML files that hold information about the books that are stored on library shelves. There is not much data, just the unique identifier, name of the book, and the name of the author. One day your boss comes to you and says: “Hey, we want to facet and sort on the basis of the book author”. You can change your XML and add two fields, but why do that when you can use Solr to do that for you? Well, Solr won’t modify your data, but it can copy the data from one field to another. This recipe will show you how to do that.

How to do it…

Let’s assume that our data looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr Cookbook</field>
<field name="author">John Kowalsky</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Some other book</field>
<field name="author">Jane Kowalsky</field>
</doc>
</add>

We want the contents of the author field to be present in the fields named author, author_facet, and author sort. So let’s define the copy fields in the schema.xml file (place the following right after the fields section):

<copyField source="author"dest="author_facet"/>
<copyField source="author"dest="author_sort"/>

And that’s all. Solr will take care of the rest.

The field definition part of the schema.xml file could look like this:

<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="author" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="name" type="text" indexed="true" stored="true"/>
<field name="author_facet" type="string" indexed="true"
stored="false"/>
<field name="author_sort" type="alphaOnlySort" indexed="true"
stored="false"/>

Let’s index our data. To do that, we run the following command from the exampledocs directory (put the ch3_html.xml file there):

java -jar post.jar data.xml

How it works…

As you can see in the example, we have only three fields defined in our sample data XML file. There are two fields which we are not particularly interested in: id and name. The field that interests us the most is the author field. As I have mentioned earlier, we want to place the contents of that field in three fields:

  • Author (the actual field that will be holding the data)
  • author_ sort
  • author_facet

 

To do that we use the copy fields. Those instructions are defined in the schema.xml file, right after the field definitions, that is, after the tag. To define a copy field, we need to specify a source field (attribute source) and a destination field (attribute dest).

After the definitions, like those in the example, Solr will copy the contents of the source fields to the destination fields during the indexing process. There is one thing that you have to be aware of—the content is copied before the analysis process takes place. This means that the data is copied as it is stored in the source.

There’s more…

There are a few things worth nothing when talking about copying contents of the field to another field.

Copying the contents of dynamic fields to one field

You can also copy multiple field content to one field. To do that, you should define a copy field like this:

<copyField source="*_author"dest="authors"/>

The definition like the one above would copy all of the fields that end with _author to one field named authors. Remember that if you copy multiple fields to one field, the destination field should be defined as multivalued.

Limiting the number of characters copied

There may be situations where you only need to copy a defined number of characters from one field to another. To do that we add the maxChars attribute to the copy field definition. It can look like this:

<copyField source="author" dest="author_facet" maxChars="200"/>

The above definition tells Solr to copy upto 200 characters from the author field to the author_facet field. This attribute can be very useful when copying the content of multiple fields to one field.

LEAVE A REPLY

Please enter your comment!
Please enter your name here