





















































(For more resources on Solr, see here.)
Lets get started.
There are a few dimensions to the options available for communicating with Solr:
Applications interact with Solr over HTTP. This can either be done directly (by hand, but by using an HTTP client of your choice), or it might be facilitated by a Solr integration API such as SolrJ or Solr Flare, which in turn use HTTP.
An exception to HTTP is offered by SolrJ, which can optionally be used in an embedded fashion with Solr (so-called Embedded Solr) to avoid network and inter process communication altogether. However, unless you are sure you really want to embed Solr within another application, this option is discouraged in favor of writing a custom Solr updating request handler.
Even though an application will be communicating with Solr over HTTP, it does not have to send Solr data over this channel. Solr supports what it calls remote streaming. Instead of giving Solr the data directly, it is given a URL that it will resolve. It might be an HTTP URL, but more likely it is a filesystem based URL, applicable when the data is already on Solr's machine. Finally, in the case of Solr's DataImportHandler, the data can be fetched from a database.
The following are the different data formats:
We'll use the XML, CSV, and DIH options in bringing the MusicBrainz data into Solr from its database to demonstrate Solr's capability. Most likely, an application would use just one format.
Before these approaches are described, we'll discuss curl and remote streaming, which are foundational topics.
Solr receives commands (and possibly the associated data) through HTTP POST.
Solr lets you use HTTP GET too (for example, through your web browser). However, this is an inappropriate HTTP verb if it causes something to change on the server, as happens with indexing. For more information on this concept, read about REST at:
http://en.wikipedia.org/wiki/Representational_State_Transfer.
One way to send an HTTP POST is through the Unix command line program curl (also available on Windows through Cygwin). Even if you don't use curl, it is very important to know how we're going to use it, because the concepts will be applied no matter how you make the HTTP messages.
There are several ways to tell Solr to index data, and all of them are through HTTP POST:
Here is an example of the first choice. Let's say we have an XML file named artists.xml in the current directory. We can post it to Solr using the following command line:
curl http://localhost:8983/solr/update -H 'Content-type:text/xml;
charset=utf-8' --data-binary @artists.xml
If it succeeds, then you'll have output that looks like this:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int><int name="QTime">128</int> </lst> </response>
To use the solr.body feature for the example above, you would do this:
curl http://localhost:8983/solr/update
-F solr.body=@artists.xml
In both cases, the @ character instructs curl to get the data from the file instead of being @artists.xml literally. If the XML is short, then you can just as easily specify it literally on the command line:
curl http://localhost:8983/solr/update
-F stream.body=' <commit />'
Notice the leading space in the value. This was intentional. In this example, curl treats @ and < to mean things we don't want. In this case, it might be more appropriate to use form-string instead of -F. However, it's more typing, and I'm feeling lazy.
In the examples above, we've given Solr the data to index in the HTTP message. Alternatively, the POST request can give Solr a pointer to the data in the form of either a file path accessible to Solr or an HTTP URL to it.
The file path is accessed by the Solr server on its machine, not the client, and it must also have the necessary operating system file permissions too.
However, just as before, the originating request does not return a response until Solr has finished processing it. If you're sending a large CSV file, then it is practical to use remote streaming. Otherwise, if the file is of a decent size or is already at some known URL, then you may find remote streaming faster and/or more convenient, depending on your situation.
Here is an example of Solr accessing a local file:
curl http://localhost:8983/solr/update
-F stream.file=/tmp/artists.xml
To use a URL, the parameter would change to stream.url, and we'd specify a URL. We're passing a name-value parameter (stream.file and the path), not the actual data.
Remote streaming must be enabled
In order to use remote streaming (stream.file or stream.url), you must enable it in solrconfig.xml. It is disabled by default and is configured on a line that looks like this:
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />