A Quick Overview of Lucene
Included with OpenCms is a distribution of the Lucene search engine. Lucene is an open source, high-performance text search engine that is both easy to use and full-featured. Lucene is not a product. It is a Java library providing data indexing, and search and retrieval support. OpenCms integrates with Lucene to provide these features for its VFS content.
Though Lucene is simple to use, it is highly flexible and has many options. We will not go into the full details of all the options here, but will provide a basic overview, which will help us in developing our search code. A full understanding of Lucene is not required for completing this article, but interested readers can find more information at the Lucene website: http://jakarta.apache.org/lucene. There are also several excellent books available, which can easily be found with a web search.
For any data to be searched, it must first be indexed. Lucene supports both disk and memory based indexes, but OpenCms uses the more suitable disk based indexes. There are three basic concepts to understand regarding Lucene search indexes: Documents, Analyzers, and Fields.
- Document: A document is a collection of Lucene fields. A search index is made up of documents. Although each document is built from some actual source content, there is no need for the document to exactly resemble it. The fields stored in the document are indexed and stored and used to locate the document.
- Analyzer: An analyzer is responsible for breaking down source content into words (or terms) for indexing. An analyzer may take a very simple approach of only parsing content at whitespace breaks or a more complex approach by removing common words, identifying email and web addresses, and understanding abbreviations or other languages. Though Lucene provides many optional analyzers, the default one used by OpenCms is usually the best choice. For more advanced search applications, the other analyzers should be looked at in more depth.
- Field: A field consists of data that can be stored, indexed, or queried. Field values are searched when a query is made to the index. There are two characteristics of a field that determine how it gets treated when indexed:
- Field Storage: The storage characteristic of a field determines whether or not the field data value gets stored into the index. It is not necessary to store field data if the value is unimportant and is used only to help locate a document. On the other hand, field data should be stored if the value needs to be returned with the search result.
- Field Indexing: This characteristic determines whether a field will get indexed, and if so, how. There is no need to index fields that will not be used as search terms, and the value should not be indexed. This is useful if we need to return a field value but will never search for the document using that field in a search term. However, for fields that are searchable, the field may be indexed in either a tokenized or an un-tokenized fashion. If a field is tokenized, then it will first be run through an analyzer. Each term generated by the analyzer will be indexed for the field. If it is un-tokenized, then the field’s value is indexed, verbatim. In this case, the term must be searched for using an exact match of its value, including the case.
Lucene also provides the ability to define a boost value for a field. This affects the relevance of the field when it is used in a search. A value other than the default value of 1.0 may be used to raise or lower the relevance.
These are the important concepts to be understood while creating a Lucene search index. After an index has been created, documents may be searched through queries.
Querying Lucene search indexes is supported through a Java API and a search querying language. Search queries are made up of terms and operators. A term can be a simple word such as “hello” or a phrase such as “hello world”. Operators are used to form logical expressions with terms, such as AND or NOT. With the Java API, terms can be built and aggregated together along with operators to form a query. When using the query language, a Java class is provided to parse the query and convert it into a format suitable for passing to the engine. In addition to these search features, there are more advanced operations that may be performed such as fuzzy searches, range searches, and proximity searches.
All these options and flexibility allow Lucene to be used in an application in many ways. OpenCms does a good job of using these options to provide search capabilities for a wide range of content types. Next, we will look at how OpenCms interfaces with Lucene to provide this support.
Configuring OpenCms Search
OpenCms maintains search settings in the opencms-search.xml configuration file located in the WEB-INF/config directory. Prior to OpenCms 7, most of the settings in this configuration file needed to be made by hand. With OpenCms 7, the Search Management tool in the Administration View has been improved to cover most of the settings. We will first go over the settings that are controlled through the Search Management view, and will then visit the settings that must still be changed by hand. The first thing we’ll do is define our own search index for the blog content. Creating a new search index is simple with the Administration tool. We access it by clicking on the Search Management icon of the Administrative View, and then clicking on the New Index icon:
The Name field contains the name of the index file. This name can also be passed to a Java API. If the content differs between the online and offline areas, we can create an index for each one. For now, we will start with the offline index. We’ll name it: Blogs – Offline. The other fields are:
- Rebuild mode: This determines if the index is to be built manually or automatically as content changes. We want automatic updating and will hence choose auto.
- Locale: We must select a locale for the content. OpenCms will extract the content for the given locale when it builds our index. If we were supporting more than one locale, then it would be a good idea to include the locale in the index name.
- Project: This selects content from either the Online or Offline project.
- Field configuration: This selects a field configuration to be used for the index.
We do not have our own field configuration yet; so for now press OK to save the index. Next, we will define a field configuration for the blog content.