12 min read

(For more resources on Solr, see here.)

The classic plugin for Rails is acts_as_solr that allows Rails ActiveRecord objects to be transparently stored in a Solr index. Other popular options include Solr Flare and rsolr. An interesting project is Blacklight, a tool oriented towards libraries putting their catalogs online. While it attempts to meet the needs of a specific market, it also contains many examples of great Ruby techniques to leverage in your own projects.

You will need to turn on the Ruby writer type in solrconfig.xml:

<queryResponseWriter name="ruby"
class="org.apache.solr.request.RubyResponseWriter"/>

The Ruby hash structure has some tweaks to fit Ruby, such as translating nulls to nils, using single quotes for escaping content, and the Ruby => operator to separate key-value pairs in maps. Adding a wt=ruby parameter to a standard search request returns results in a Ruby hash structure like this:

{
'responseHeader'=>{
'status'=>0,
'QTime'=>1,
'params'=>{
'wt'=>'ruby',
'indent'=>'on',
'rows'=>'1',
'start'=>'0',
'q'=>'Pete Moutso'}},
'response'=>{'numFound'=>523,'start'=>0,'docs'=>[
{
'a_name'=>'Pete Moutso',
'a_type'=>'1',
'id'=>'Artist:371203',
'type'=>'Artist'}]
}}

acts_as_solr

A very common naming pattern for plugins in Rails that manipulate the database backed object model is to name them acts_as_X. For example, the very popular acts_as_list plugin for Rails allows you to add list semantics, like first, last, move_next to an unordered collection of items. In the same manner, acts_as_solr takes ActiveRecord model objects and transparently indexes them in Solr. This allows you to do fuzzy queries that are backed by Solr searches, but still work with your normal ActiveRecord objects. Let’s go ahead and build a small Rails application that we’ll call MyFaves that both allows you to store your favorite MusicBrainz artists in a relational model and allows you to search for them using Solr.

acts_as_solr comes bundled with a full copy of Solr 1.3 as part of the plugin, which you can easily start by running rake solr:start. Typically, you are starting with a relational database already stuffed with content that you want to make searchable. However, in our case we already have a fully populated index available in /examples, and we are actually going to take the basic artist information out of the mbartists index of Solr and populate our local myfaves database with it. We’ll then fire up the version of Solr shipped with acts_as_solr, and see how acts_as_solr manages the lifecycle of ActiveRecord objects to keep Solr’s indexed content in sync with the content stored in the relational database. Don’t worry, we’ll take it step by step! The completed application is in /examples/8/myfaves for you to refer to.

Setting up MyFaves project

We’ll start with the standard plumbing to get a Rails application set up with our basic data model:

>>rails myfaves
>>cd myfaves
>>./script/generate scaffold artist name:string group_type:string
release_date:datetime image_url:string
>>rake db:migrate

This generates a basic application backed by an SQLite database. Now we need to install the acts_as_solr plugin.

acts_as_solr has gone through a number of revisions, from the original code base done by Erik Hatcher and posted to the solr-user mailing list in August of 2006, which was then extended by Thiago Jackiw and hosted on Rubyforge. Today the best version of acts_as_solr is hosted on GitHub by Mathias Meyer at http://github.com/ mattmatt/acts_as_solr/tree/master. The constant migration from one site to another leading to multiple possible ‘best’ versions of a plugin is unfortunately a very common problem with Rails plugins and projects, though most are settling on either RubyForge.org or GitHub.com.

In order to install the plugin, run:

 

>>script/plugin install git://github.com/mattmatt/acts_as_solr.gitt

We’ll also be working with roughly 399,000 artists, so obviously we’ll need some page pagination to manage that list, otherwise pulling up the artists /index listing page will timeout:

 

>>script/plugin install git://github.com/mislav/will_paginate.git

Edit the ./app/controllers/artists_controller.rb file, and replace in the index method the call to @artists = Artist.find(:all) with:

@artists = Artist.paginate :page => params[:page], :order =>
'created_at DESC'

Also add to ./app/views/artists/index.html.erb a call to the view helper to generate the page links:

<%= will_paginate @artists %>

Start the application using ./script/server, and visit the page http://localhost:3000/artists/. You should see an empty listing page for all of the artists. Now that we know the basics are working, let’s go ahead and actually leverage Solr.

Populating MyFaves relational database from Solr

Step one will be to import data into our relational database from the mbartists Solr index. Add the following code to ./app/models/artist.rb:

class Artist < ActiveRecord::Base
acts_as_solr :fields => [:name, :group_type, :release_date]
end

The :fields array of hashes maps the attributes of the Artist ActiveRecord object to the artist fields in Solr’s schema.xml. Because acts_as_solr is designed to store data in Solr that is mastered in your data model, it needs a way of distinguishing among various types of data model objects. For example, if we wanted to store information about our User model object in Solr in addition to the Artist object then we need to provide a type_field to separate the Solr documents for the artist with the primary key of 5 from the user with the primary key of 5. Fortunately the mbartists schema has a field named type that stores the value Artist, which maps directly to our ActiveRecord class name of Artist and we are able to use that instead of the default acts_as_solr type field in Solr named type_s.

There is a simple script called populate.rb at the root of /examples/8/myfaves that you can run that will copy the artist data from the existing Solr mbartists index into the MyFaves database:

>>ruby populate.rb

populate.rb is a great example of the types of scripts you may need to develop to transfer data into and out of Solr. Most scripts typically work with some sort of batch size of records that are pulled from one system and then inserted into Solr. The larger the batch size, the more efficient the pulling and processing of data typically is at the cost of more memory being consumed, and the slower the commit and optimize operations are. When you run the populate.rb script, play with the batch size parameter to get a sense of resource consumption in your environment. Try a batch size of 10 versus 10000 to see the changes. The parameters for populate.rb are available at the top of the script:

MBARTISTS_SOLR_URL = 'http://localhost:8983/solr/mbartists'
BATCH_SIZE = 1500
MAX_RECORDS = 100000 # the maximum number of records to load,
or nil for all

There are roughly 399,000 artists in the mbartists index, so if you are impatient, then you can set MAX_RECORDS to a more reasonable number.

The process for connecting to Solr is very simple with a hash of parameters that are passed as part of the GET request. We use the magic query value of *:* to find all of the artists in the index and then iterate through the results using the start parameter:

connection = Solr::Connection.new(MBARTISTS_SOLR_URL)
solr_data = connection.send(Solr::Request::Standard.new({
:query => '*:*',
:rows=> BATCH_SIZE,
:start => offset,
:field_list =>['*','score']
}))

In order to create our new Artist model objects, we just iterate through the results of solr_data. If solr_data is nil, then we exit out of the script knowing that we’ve run out of results. However, we do have to do some parsing translation in order to preserve our unique identifiers between Solr and the database. In our MusicBrainz Solr schema, the ID field functions as the primary key and looks like Artist:11650 for The Smashing Pumpkins. In the database, in order to sync the two, we need to insert the Artist with the ID of 11650. We wrap the insert statement a.save! in a begin/rescue/end structure so that if we’ve already inserted an artist with a primary key, then the script continues. This just allows us to run the populate script multiple times:

solr_data.hits.each do |doc|
id = doc["id"]
id = id[7..(id.length)]
a = Artist.new(:name => doc["a_name"], :group_type => a["a_type"],
:release_date => doc["a_release_date_latest"])
a.id = id
begin
a.save!
rescue ActiveRecord::StatementInvalid => ar_si
raise ar_si unless ar_si.to_s.include?("PRIMARY KEY must be
unique") #sink duplicates
end
end

Now that we’ve transferred the data out of our mbartists index and used acts_as_solr according to the various conventions that it expects, we’ll change from using the mbartists Solr instance to the version of Solr shipped with acts_as_solr.

Solr related configuration information is available in ./myfaves/config/solr.xml. Ensure that the default development URL doesn’t conflict with any existing Solr’s you may be running:

development:
url: http://127.0.0.1:8982/solr

Start the included Solr by running rake solr:start. When it starts up, it will report the process ID for Solr running in the background. If you need to stop the process, then run the corresponding rake task: rake solr:stop. The empty new Solr indexes are stored in ./myfaves/solr/development.

Build Solr indexes from relational database

Now we are ready to trigger a full index of the data in the relational database into Solr. acts_as_solr provides a very convenient rake task for this with a variety of parameters that you can learn about by running rake -D solr:reindex. We’ll specify to work with a batch size of 1500 artists at a time:

>>rake solr:start
>>% rake solr:reindex BATCH=1500
(in /examples/8/myfaves)
Clearing index for Artist...
Rebuilding index for Artist...
Optimizing...

This drastic simplification of configuration in the Artist model object is because we are using a Solr schema that is designed to leverage the Convention over Configuration ideas of Rails. Some of the conventions that are established by acts_as_solr and met by Solr are:

  • Primary key field for model object in Solr is always called pk_i.
  • Type field that stores the disambiguating class name of the model object is called type_s.
  • Heavy use of the dynamic field support in Solr. The data type of ActiveRecord model objects is based on the database column type. Therefore, when acts_as_solr indexes a model object, it sends a document to Solr with the various suffixes to leverage the dynamic column creation. In /examples/8/myfaves/vendor/plugins/acts_as_solr/solr/solr/conf/ schema.xml, the only fields defined outside of the management fields are dynamic fields:

    <dynamicField name="*_t" type="text" indexed="true"
    stored="false"/>

  • The default search field is called text. And all of the fields ending in _t are copied into the text search field.
  • Fields to facet on are named _facet and copied into the text search field as well.

The document that gets sent to Solr for our Artist records creates the dynamic fields name_t, group_type_s and release_date_d, for a text, string, and date field respectively. You can see the list of dynamic fields generated through the schema browser at http://localhost:8982/solr/admin/schema.jsp.

Now we are ready to perform some searches. acts_as_solr adds some new methods such as find_by_solr() that lets us find ActiveRecord model objects by sending a query to Solr. Here we find the group Smash Mouth by searching for matches to the word smashing:

% ./script/console
Loading development environment (Rails 2.3.2)
>> artists = Artist.find_by_solr("smashing")
=> #<ActsAsSolr::SearchResults:0x224889c @solr_data={:total=>9,
:docs=>[#<Artist id: 364, name: "Smash Mouth"...
>> artists.docs.first
=> #<Artist id: 364, name: "Smash Mouth", group_type: 1,
release_date: "2006-09-19 04:00:00", created_at: "2009-04-17
18:02:37", updated_at: "2009-04-17 18:02:37">

Let’s also verify that acts_as_solr is managing the full lifecycle of our objects. Assuming Susan Boyle isn’t yet entered as an artist, let’s go ahead and create her:

 

>> Artist.find_by_solr("Susan Boyle")
=> #<ActsAsSolr::SearchResults:0x26ee298 @solr_data={:total=>0,
:docs=>[]}>
>> susan = Artist.create(:name => "Susan Boyle", :group_type => 1,
:release_date => Date.new)
=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1,
release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21
13:11:09", updated_at: "2009-04-21 13:11:09">

Check the log output from your Solr running on port 8982, and you should also have seen an update query triggered by the insert of the new Susan Boyle record:

INFO: [] webapp=/solr path=/update params={} status=0 QTime=24

Now, if we delete Susan’s record from our database:

>> susan.destroy
=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1,
release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21
13:11:09", updated_at: "2009-04-21 13:11:09">
=> #<Artist id: 548200, name: "Susan Boyle", group_type: 1,
release_date: "-4712-01-01 05:00:00", created_at: "2009-04-21
13:11:09", updated_at: "2009-04-21 13:11:09">

Then there should be another corresponding update issued to Solr to remove the document:

INFO: [] webapp=/solr path=/update params={} status=0 QTime=57

You can verify this by doing a search for Susan Boyle directly, which should return no rows at http://localhost:8982/solr/select/?q=Susan+Boyle.

LEAVE A REPLY

Please enter your comment!
Please enter your name here