Getting Started with Apache Solr

0
124
8 min read

 

(For more resources on Apache, see here.)

We’re going to get started by downloading Solr, examine its directory structure, and then finally run it. This sets you up for the next section, which tours a running Solr server.

Get Solr: You can download Solr from its website: http://lucene.apache.org/ solr/. The last Solr release this article was written for is version 3.4. Solr has had several relatively minor point-releases since 3.1 and it will continue. In general I recommend using the latest release since Solr and Lucene’s code are extensively tested. Lucid Imagination also provides a Solr distribution called “LucidWorks for Solr”. As of this writing it is Solr 3.2 with some choice patches that came after to ensure its stability and performance. It’s completely open source; previous LucidWorks releases were not as they included some extras with use limitations. LucidWorks for Solr is a good choice if maximum stability is your chief concern over newer features.

Get Java: The only prerequisite software needed to run Solr is Java 5 (a.k.a. java version 1.5) or later—ideally Java 6. Typing java –version at a command line will tell you exactly which version of Java you are using, if any.

Use latest version of Java! The initial release of Java 7 included some serious bugs that were discovered shortly before its release that affect Lucene and Solr. The release of Java 7u1 on October 19th, 2011 resolves these issues. These same bugs occurred with Java 6 under certain JVM switches, and Java 6u29 resolves them. Therefore, I advise you to use the latest Java release.

Java is available on all major platforms including Windows, Solaris, Linux, and Apple. Visit http://www.java.com to download the distribution for your platform. Java always comes with the Java Runtime Environment (JRE) and that’s all Solr requires. The Java Development Kit (JDK) includes the JRE plus the Java compiler and various diagnostic utility programs. One such useful program is jconsole, and so the JDK distribution is recommended.

Solr is a Java-based web application, but you don’t need to be particularly familiar with Java in order to use it.

Solr’s installation directory structure

When you unzip Solr after downloading it, you should find a relatively straightforward directory structure:

  • client: Convenient language-specific client APIs for talking to Solr.

    Ignore the client directory Most client libraries are maintained by other organizations, except for the Java client SolrJ which lies in the dist/ directory. client/ only contains solr-ruby , which has fallen out of favor compared to rsolr —both of which are Ruby Solr clients.

  • contrib: Solr contrib modules. These are extensions to Solr. The final JAR file for each of these contrib modules is actually in dist/; so the actual files here are mainly the dependent JAR files.
    • analysis-extras: A few text analysis components that have large dependencies. There are some “ICU” Unicode classes for multilingual support, a Chinese stemmer, and a Polish stemmer.
    • clustering: A engine for clustering search results.
    • dataimporthandler: The DataImportHandler (DIH) —a very popular contrib module that imports data into Solr from a database and some other sources.
    • extraction: Integration with Apache Tika– a framework for extracting text from common file formats. This module is also called SolrCell and Tika is also used by the DIH’s TikaEntityProcessor.
    • uima: Integration with Apache UIMA—a framework for extracting metadata out of text. There are modules that identify proper names in text and identify the language, for example. To learn more, see Solr’s wiki: http://wiki.apache.org/solr/SolrUIMA.
    • velocity: Simple Search UI framework based on the Velocity templating language.
  • dist: Solr’s WAR and contrib JAR files. The Solr WAR file is the main artifact that embodies Solr as a standalone file deployable to a Java web server. The WAR does not include any contrib JARs. You’ll also find the core of Solr as a JAR file, which you might use if you are embedding Solr within an application, and Solr’s test framework as a JAR file, which is to assist in testing Solr extensions. You’ll also see SolrJ’s dependent JAR files here.
  • docs: Documentation—the HTML files and related assets for the public Solr website, to be precise. It includes a good quick tutorial, and of course Solr’s API. Even if you don’t plan on extending the API, some parts of it are useful as a reference to certain pluggable Solr configuration elements—see the listing for the Java package org.apache.solr.analysis in particular.
  • example: A complete Solr server, serving as an example. It includes the Jetty servlet engine (a Java web server), Solr, some sample data and sample Solr configurations. The interesting child directories are:
    • example/etc: Jetty’s configuration. Among other things, here you can change the web port used from the pre-supplied 8983 to 80 (HTTP default).
    • exampledocs: Sample documents to be indexed into the default Solr configuration, along with the post.jar program for sending the documents to Solr.
    • example/solr: The default, sample Solr configuration. This should serve as a good starting point for new Solr applications. It is used in Solr’s tutorial.
    • example/webapps: Where Jetty expects to deploy Solr from. A copy of Solr’s WAR file is here, which contains Solr’s compiled code.

Solr’s home directory and Solr cores

When Solr starts, the very first thing it does is determine where the Solr home directory is. There are various ways to tell Solr where it is, but by default it’s the directory named simply solr relative to the current working directory where Solr is started. You will usually see a solr.xml file in the home directory, which is optional but recommended. It mainly lists Solr cores. For simpler configurations like example/solr, there is just one Solr core, which uses Solr’s home directory as its core instance directory . A Solr core holds one Lucene index and the supporting Solr configuration for that index. Nearly all interactions with Solr are targeted at a specific core. If you want to index different types of data separately or shard a large index into multiple ones then Solr can host multiple Solr cores on the same Java server.

A Solr core’s instance directory is laid out like this:

  • conf: Configuration files. The two I mention below are very important, but it will also contain some other .txt and .xml files which are referenced by these two.
  • conf/schema.xml: The schema for the index including field type definitions with associated analyzer chains.
  • conf/solrconfig.xml: The primary Solr configuration file.
  • conf/xslt: Various XSLT files that can be used to transform Solr’s XML query responses into formats such as Atom and RSS.
  • conf/velocity: HTML templates and related web assets for rapid UI prototyping using Solritas. The soon to be discussed “browse” UI is implemented with these templates.
  • data: Where Lucene’s index data lives. It’s binary data, so you won’t be doing anything with it except perhaps deleting it occasionally to start anew.
  • lib: Where extra Java JAR files can be placed that Solr will load on startup. This is a good place to put contrib JAR files, and their dependencies.

Running Solr

Now we’re going to start up Jetty and finally see Solr running albeit without any data to query yet.

We’re about to run Solr directly from the unzipped installation. This is great for exploring Solr and doing local development, but it’s not what you would seriously do in a production scenario. In a production scenario you would have a script or other mechanism to start and stop the servlet engine with the operating system—Solr does not include this. And to keep your system organized, you should keep the example directly as exactly what its name implies—an example. So if you want to use the provided Jetty servlet engine in production, a fine choice then copy the example directory elsewhere and name it something else.

First go to the example directory, and then run Jetty’s start.jar file by typing the following command:

>>cd example >>java -jar start.jar

The > > notation is the command prompt. These commands will work across *nix and DOS shells. You’ll see about a page of output, including references to Solr. When it is finished, you should see this output at the very end of the command prompt:

2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983

The 0.0.0.0 means it’s listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn’t necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output if something went wrong. Even if it did go wrong, you should be able to access the web server: http://localhost:8983. Jetty will give you a 404 page but it will include a list of links to deployed web applications, which will just be Solr for this setup. Solr is accessible at: http://localhost:8983/solr, and if you browse to that page, then you should either see details about an error if Solr wasn’t loaded correctly, or a simple page with a link to Solr’s admin page, which should be http://localhost:8983/solr/admin/. You’ll be visiting that link often.

To quit Jetty (and many other command line programs for that matter), press Ctrl+C on the keyboard.

LEAVE A REPLY

Please enter your comment!
Please enter your name here