(For more resources related to this topic, see here.)

Understanding data processing workflows

Based on the data, configuration, and the requirements, data can be processed at multiple levels while it is getting ready for search. Cascading and LucidWorks Big Data are few such application platforms with which a complex data processing workflow can be rapidly developed on the Hadoop framework. In Cascading, the data is processed in different phases, with each phase containing a pipe responsible for carrying data units and applying a filter. The following diagram shows how incoming data can be processed in the pipeline-based workflow:

making-big-data-work-hadoop-and-solr-img-0

Once a data is passed through the workflow, it can be persisted at the end with repository, and later synced with various nodes running in a distributed environment. The pipelining technique offers the following advantages:

Apache Solr engine has minimum work to handle while index creation
Incremental indexing can be supported
By introducing intermediate store, you can have regular data backups at required stages
The data can be transferred to a different type of storage such as HDFS directly through multiple processing units
The data can be merged, joined, and processed as per the needs for different data sources

LucidWorks Big Data is a more powerful product which helps the user to generate bulk indexes on Hadoop, allowing them to classify and analyze the data, and provide distributed searching capabilities.

Sharding is a process of breaking one index into multiple logical units called shards across multiple records. In case of Solr, the results will be aggregated and returned.

Big Data based technologies can be used with Apache Solr for various operations. Index creation itself can be made to run on distributed system in order to speed up the overall index generation activity. Once that is done, it can be distributed on different nodes participating in Big Data, and Solr can be made to run in a distributed manner for searching the data. You can set up your Solr instance in the following different configurations:

Standalone machine

This configuration uses single high end server containing indexes and Solr search; it is suitable for development, and in some cases, production.

Distributed setup

A distributed setup is suitable for large scale indexes where the index is difficult to store on one system. In this case index has to be distributed across multiple machines. Although distributed configuration of Solr offers ample flexibility in terms of processing, it has its own limitations. A lot of features of Apache Solr such as MoreLikeThis and Joins are not supported. The following diagram depicts the distributed setup:

making-big-data-work-hadoop-and-solr-img-1

Replicated mode

In this mode, more than one Solr instance exists; among them the master instance provides shared access to its slaves for replicating the indexes across multiple systems. Master continues to participate in index creation, search, and so on. Slaves sync up the storage through various replication techniques such as rsync utility. By default, Solr includes Java-based replication that uses HTTP protocol for communication. This replication is recommended due to its benefits over other external replication techniques. This mode is not used anymore with the release of Solr 4.x versions.

Sharded mode

This mode combines the best of both the worlds and brings in the real value of distributed system with high availability. In this configuration, the system has multiple masters, and each master holds multiple slaves where the replication has gone through. Load balancer is used to handle the load on multiple nodes equally.

The following diagram depicts the distributed and replicated setup:

making-big-data-work-hadoop-and-solr-img-2

If Apache Solr is deployed on a Hadoop-like framework, it falls into this category. Solr also provides SolrCloud for distributed Solr. We are going to look at different approaches in the next section.

Using Solr 1045 patch – map-side indexing

The work for Solr-1045 patch started with a goal to achieve index generation/building using the Apache MapReduce task. Solr-1045 patch converts all the input records to a set of <key, value> pairs in each map task that runs on Hadoop. Further it goes on creating SolrInputDocument from the <key, value>, and later creating the Solr indexes.The following diagram depicts this process:

making-big-data-work-hadoop-and-solr-img-3

Reduce tasks can be used to perform deduplication of indexes, and merge them together if required. Although merge index seems to be an interesting feature, it is actually a costly affair in terms of processing, and you will not find many implementations with merge index functionality. Once the indexes are created, you can load them on your Solr instance and use them for searching.

You can download this particular patch from https://issues.apache.org/jira/browse/SOLR-1045, and patch your Solr instance. To apply a patch to your Solr instance, you need to first build your Solr instance using source. You can download the patch from Apache JIRA. Before running the patch, first do a dry run which does not actually apply patch. You can do it with following command:

cd <solr-trunk-dir>
svn patch <name-of-patch> --dry-run

If it is successful, you can run the patch without the –dry-run option to apply the patch. Let's look at some of the important classes in the patch.

Important class	Description
SolrIndexUpdateMapper	This class is a Hadoop mapper responsible for creating indexes out of <key, value> pairs of input.
SolrXMLDocRecordReader	This class is responsible for reading Solr input XML files.
SolrIndexUpdater	This class creates a MapReduce job configuration, runs the job to read the document, and updates the Solr instance. Right now it is built using the Lucene index updater.

Benefits and drawbacks

The following are the benefits and drawbacks of using the Solr-1045 patch:

Benefits

It achieves complete parallelism by index creation right at the map task.
Merging of indexes is possible in the reduce phase of MapReduce.

Drawbacks

When the indexing is done at map-side, all the <key, value> pairs received by reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information.

Using Solr 1301 patch – reduce-side indexing

This patch focuses on using the Apache MapReduce framework for index creation. Keyword search can happen over Apache Solr or Apache SolrCloud. Unlike Solr-1045, in this patch, the indexes are created in the reduce phase of MapReduce. In this patch, a map task is responsible for converting input records to a <key, value> pair; later, they are passed to the reducer, which in turn converts them into SolrInputDocument, and then creates indexes out of it. This index is then passed as outputs of Hadoop MapReduce process. The following diagram depicts this process:

making-big-data-work-hadoop-and-solr-img-4

To use Solr-1301 patch, you need to set up a Hadoop cluster. Once the index is created through Hadoop patch, it should then be provisioned to Solr server. The patch contains default converter for CSV files. Let's look at some of the important classes which are part of this patch.

Important class	Description
CSVDocumentConverter	This class is responsible for converting output of the map task, that is, key-value pair to SolrInputDocument; you can have multiple document converters.
CSVReducer	This is a reducer code implemented for Hadoop reducers.
CSVIndexer	This is the main class to be called from your command line for creating indexes using MapReduce. You need to provide input path for your data and output path for storing shards.
SolrDocumentConverter	This class is used in your map task for converting your objects in Solr document.
SolrRecordWriter	This class is an extension of mapreduce.RecordWriter; it breaks the data into multiple (key, value) pairs which are then converted into collection of SolrInputDocument(s), and then this data is submitted to SolrEmbeddedServer in batches. Once completed, it will commit the changes and run the optimizer on the embedded server.
CSVMapper	This class parses CSV file and gets key-value pair out of it. This is a mapper class.
SolrOutputFormat	This class is responsible for converting key-value pairs to write the data on file/HDFS as zip/raw format.

Perform the following steps to run this patch:

Create a local folder with configuration and library folder, conf containing Solr configuration (solr-config.xml, schema.xml), and lib containing library.
Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. You may also override the OutputFormat class provided by Solr.

Write the Hadoop MapReduce job in the configuration writer:

SolrOutputFormat.setupSolrHomeCache(new
File(solrConfigDir), conf);
conf.setOutputFormat(SolrOutputFormat.class);
SolrDocumentConverter.setSolrDocumentConverter(<your
classname>.class, conf);

Zip your configuration, and load it in HDFS. The ZIP file name should be solr.zip (unless you change the patch code).
Now run the patch, each of the jobs will instantiate EmbeddedSolrInstance which will in turn do the conversion, and finally the SolrOutputDocument(s) get stored in the output format.

Benefits and drawbacks

The following are the benefits and drawbacks of using Solr-1301 patch:

Benefits

With reduced size index generation, it is possible to preserve the weights of documents, which can contribute while performing a prioritization during a search query.

Drawbacks

Merging of indexes is not possible like in Solr-1045, as the indexes are created in the reduce phase.
Reducer becomes the crucial component of the system due to major tasks being performed.

Summary

In this article, we have understood different possible approaches of how Big Data can be made to work with Apache Hadoop and Solr. We also looked at the benefits of and drawbacks these approaches.