(For more resources on Solr, see here.)

Direct database and XML import

The capability for Solr to get data directly from a database or HTTP GET accessible XML is distributed with Solr as a contrib module, and it is known as the DataImportHandler (DIH in short). The complete reference documentation for this capability is here at http://wiki.apache.org/solr/DataImportHandler, and it's rather thorough. In this article, we'll only walk through an example to see how it can be used with the MusicBrainz data set.

In short, the DIH offers the following capabilities:

Imports data from databases through JDBC (Java Database Connectivity)

Imports XML data from a URL (HTTP GET) or a file

Can combine data from different tables or sources in various ways

Extraction/Transformation of the data

Import of updated (delta) data from a database, assuming a last-updated date

A diagnostic/development web page

Extensible to support alternative data sources and transformation steps

As the MusicBrainz data is in a database, the most direct method to get data into Solr is definitely through the DIH using JDBC.

Getting started with DIH

DIH is not a direct part of Solr. Hence it might not be included in your Solr setup. It amounts to a JAR file named something like apache-solr-dataimporthandler-1.4.jar, which is probably already embedded within the solr.war file. You can use jar -tf solr.war to see. Alternatively, it may be placed in <solr-home>/lib, which is alongside the conf directory we've been working with. For database connectivity, we need to ensure that the JDBC driver is on the Java classpath. Placing it in <solr-home>/lib is a convenient way to do this.

The DIH needs to be registered with Solr in solrconfig.xml. Here is how it is done:

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">mb-dih-artists-jdbc.xml</str>
</lst>
</requestHandler>

mb-dih-artists-jdbc.xml (mb being short for MusicBrainz) is a file in <solr-home>/conf, which is used to configure DIH. It is possible to specify some configuration aspects in this request handler configuration instead of the dedicated configuration file. However, I recommend that it all be in the DIHconfig file, as in our example here.

Given below is an mb-dih-artists-jdbc.xml file with a rather long SQL query:

<dataConfig>
 <dataSource name="jdbc" driver="org.postgresql.Driver"
 url="jdbc:postgresql://localhost/musicbrainz_db"
 user="musicbrainz" readOnly="true" autoCommit="false" />
 <document>
 <entity name="artist" dataSource="jdbc" pk="id" query="
 select
 a.id as id,
 a.name as a_name, a.sortname as a_name_sort,
 a.begindate as a_begin_date, a.enddate as a_end_date,
 a.type as a_type
 ,array_to_string(
 array(select aa.name from artistalias aa
 where aa.ref = a.id )
 , '|') as a_alias
 ,array_to_string(
 array(select am.name from v_artist_members am
 where am.band = a.id order by am.id)
 , '|') as a_member_name
 ,array_to_string(
 array(select am.id from v_artist_members am
 where am.band = a.id order by am.id)
 , '|') as a_member_id,
 (select re.releasedate from release re inner join
 album r on re.album = r.id where r.artist = a.id
 order by releasedate desc limit 1) as
 a_release_date_latest
 from artist a
 "
 transformer="RegexTransformer,DateFormatTransformer,
 TemplateTransformer">
 <field column = "id" template="Artist:${artist.id}" />
 <field column = "type" template="Artist" />
 <field column = "a_begin_date"
 dateTimeFormat="yyyy-MM-dd" />
 <field column = "a_end_date"
 dateTimeFormat="yyyy-MM-dd" />
 <field column = "a_alias" splitBy="|" />
 <field column = "a_member_name" splitBy="|"/>
 <field column = "a_member_id" splitBy="|" />
 </entity>
 </document>
</dataConfig>

The DIH development console

Before describing the configuration details, we're going to take a look at the DIH development console. It is accessed by going to this URL (modifications may be needed for your host, port, core, and so on):
http://localhost:8983/solr/admin/dataimport.jsp

The development console looks like the following screenshot:

indexing-data-solr-14-enterprise-search-server-part2-img-0

The screen is divided into two panes: on the left is the DIH control form, which includes an editable version of the DIH configuration file and on the right is the command output as raw XML. The screen works quite simply. The form essentially results in submitting a URL to the right pane. There's no real server-side logic to this interface beyond the standard DIH command invocations being executed on the right. The last section on DIH in this article goes into more detail on submitting a command to the DIH.

DIH DataSources of type JdbcDataSource

The DIH configuration file starts with the declaration of one or more data sources using the element <dataSource/>, which refers to either a database, a file, or an HTTP URL, depending on the type attribute. It defaults to a value of JdbcDataSource. Those familiar with JDBC should find the driver and url attributes with accompanying user and password straightforward—consult the documentation for your driver/database for further information. readOnly is a boolean that will set a variety of other JDBC options appropriately when set to true. And batchSize is an alias for the JDBC fetchSize and defaults to 500. There are numerous JDBC oriented attributes that can be set as well. I would not normally recommend learning about a feature by reading source code, but this is an exception. For further information, read org.apache.solr.handler.dataimport.JdbcDataSource.java

Efficient JDBC configuration

Many database drivers in the default configurations (including those for PostgreSQL and MySQL) fetch all of the query results into the memory instead of on-demand or using a batch/fetch size. This may work well for typical database usage like OLTP (Online Transaction Processing systems), but is completely unworkable for ETL (Extract Transform and Load) usage such as this. Configuring the driver to stream the data requires driver-specific configuration options. You may need to consult relevant documentation for the JDBC driver. For PostgreSQL, set autoCommit to false. For MySQL, set batchSize to -1(The DIH detects the -1 and replaces it with Java's Integer.MIN_VALUE, which triggers the special behavior in MySQL's JDBC driver). For Microsoft SQL Server, set responseBuffering to adaptive. Further information about specific databases is at :
http://wiki.apache.org/solr/DataImportHandlerFaq..

DIH documents, entities

After the declaration of <dataSource/> element(s) is the <document/> element. In turn, this element contains one or more <entity/> elements. In this sample configuration, we're only getting artists. However, if we wanted to have more than one type in the same index, then another could be added. The dataSource attribute references a correspondingly named element earlier. It is only necessary if there are multiple to choose from, but we've put it here explicitly anyway.

The main piece of an entity used with a JDBC data source is the query attribute, which is the SQL query to be evaluated. You'll notice that this query involves some sub-queries, which are made into arrays and then transformed into strings joined by spaces. The particular functions used to do these sorts of things are generally database specific. This is done to shoe-horn multi-valued data into a single row in the results. It may create a more complicated query, but it does mean that the database does all of the heavy lifting so that all of the data Solr needs for an artist is in the row. An alternative with DIH is to declare other entities within the entity. If you aren't using a database or if you wish to mix in another data source (even if it's of a different type), then you will be forced to do that. See the Solr DIH Wiki page for examples: http://wiki.apache.org/solr/DataImportHandler.

The DIH also supports a delta query, which is a query that selects time-stamped data with dates after the last queried date. This won't be covered here, but you can find more information at the previous URL.

DIH fields and transformers

Within the <entity/> are some <field/>elements that declare how the columns in the query map to Solr. The field element must have a column attribute that matches the corresponding named column in the SQL query. The name attribute is the Solr schema field name that the column is going into. If it is not specified (and it never is for our example), then it defaults to the column name.

Use the SQL as a keyword as we've done to use the same names as the Solr schema instead of the database schema. This reduces the number of explicit mappings needed in <field/> elements and shortens existing ones.

When a column in the result can be placed directly into Solr without further processing, there is no need to specify the field declaration, because it is implied.

An attribute of the entity declaration that we didn't mention yet is transformer. This declares a comma-separated list of transformers that manipulate the transfer of data from the JDBC resultset into a Solr field. These transformers evaluate a field, if it has an attribute it uses to do its job. More than one might operate on a given field. Therefore, the order in which the transformers are declared in matters. Here are the attributes we've used:

template: It is used by TemplateTransformer and declares text, which might include variable name substitutions using ${name} syntax. To access an existing field, use the entityname.columnname syntax.

splitBy: It is used by RegexTransformer and splits a single string value into a multi-value by looking for the specified character.

dateTimeFormat: It is used by DateFormatTransformer. This is a Java date/time format pattern http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html). If the type of the field in the schema is a date, then it is necessary to ensure Solr can interpret the format. Alternatively, ensure that the string matches the ISO-8601 format, which looks like this: 1976-10-23T23:59:59.000Z. As in all cases in Solr, when specifying dates you can use its so-called "DateMath" syntax such as appending /DAY to tell Solr to round the date to a day.