9 min read

In this article by Andrea Gazzarini, author of the book Apache Solr Essentials, contains information on the various ways in which you can deploy Solr, including key features and pros and cons for each scenario.

Solr has a wide range of deployment alternatives, from monolithic to distributed indexes and standalone to clustered instances. We will organize this article by deployment scenarios, with a growing level of complexity.

This article will cover the following topics:

  • Sharding
  • Replication: master, slave, and repeaters

(For more resources related to this topic, see here.)

Standalone instance

All the examples use a standalone instance of Solr, that is, one or more cores managed by a Solr deployment hosted in a standalone servlet container (for example, Jetty, Tomcat, and so on).

This kind of deployment is useful for development because, as you learned, it is very easy to start and debug. Besides, it can also be suitable for a production context if you don’t have strict non-functional requirements and have a small or medium amount of data.

I have used a standalone instance to provide autocomplete services for small and medium intranet systems.

Anyway, the main features of this kind of deployment are simplicity and maintainability; one simple node acts as both an indexer and a searcher. The following diagram depicts a standalone instance with two cores:

Shards

When a monolithic index becomes too large for a single node or when additions, deletions, or queries take too long to execute, the index can be split into multiple pieces called shards.

The previous sentence highlights a logical and theoretical evolution path of a Solr index. However, this (in general) is valid for all scenarios we will describe. It is strongly recommended that you perform a preliminary analysis of your data and the estimated growth factor in order to decide from the beginning the right configuration that suits your requirements. Although it is possible to split an existing index into shards (https://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/index/PKIndexSplitter.html), things definitely become easier if you start directly with a distributed index (if you need it, of course).

The index is split vertically so that each shard contains a disjoint set of the entire index. Solr will query and merge results across those shards. The following diagram illustrates a Solr deployment with 3 nodes; this deployment consists of two cores (C1 and C2) divided into three shards (S1, S2, and S3):

When using shards, only query requests are distributed. This means that it’s up to the indexer to add and distribute the data across nodes, and to subsequently forward a change request (that is, delete, replace, and commit) for a given document to the appropriate shard (the shard that owns the document).

The Solr Wiki recommends a simple, hash-based algorithm to determine the shard where a given document should be indexed:

documentId.hashCode() % numServers

Using this approach is also useful in order to know in advance where to send delete or update requests for a given document.

On the opposite side, a searcher client will send a query request to any node, but it has to specify an additional shards parameter that declares the target shards that will be queried. In the following example, assuming that two shards are hosted in two servers listening to ports 8080 and 8081, the same request when sent to both nodes will produce the same result:

http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2
http://localhost:8081/solr/c2/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2

When sending a query request, a client can optionally include a pseudofield associated with the [shard] transformer. In this case, as a part of each returned document, there will be additional information indicating the owning shard. This is an example of such a request:

http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2&src_shard:[shard]

Here is the corresponding response (note the pseudofield aliased as src_shard):

<result name="response" numFound="192" start="0">
<doc>
   <str name="id">9920</str>
   <str name="brand">Fender</str>
   <str name="model">Jazz Bass</str>
   <arr name="artist">
   <str>Marcus Miller</str>
   </arr><str name="series">Marcus Miller signature</str>
   <str name="src_shard">localhost:8080/solr/shard1</str>
</doc>
…
<doc>
   <str name="id">4392</str>
   <str name="brand">Music Man</str>
   <str name="model">Sting Ray</str>
   <arr name="artist"><str>Tony Levin</str></arr>
   <str name="series">5 strings DeLuxe</str>
   <str name="src_shard">localhost:8081/solr/shard2</str>
</doc>
</result>

The following are a few things to keep in mind when using this deployment scenario:

  • The schema must have a uniqueKey field. This field must be declared as stored and indexed; in addition, it is supposed to be unique across all shards.
  • Inverse Document Frequency (IDF) calculations cannot be distributed.
  • IDF is computed per shard.
  • Joins between documents belonging to different shards are not supported.
  • If a shard receives both index and query requests, the index may change during a query execution, thus compromising the outgoing results (for example, a matching document that has been deleted).

Master/slaves scenario

In a master/slaves scenario, there are two types of Solr servers: an indexer (the master) and one or more searchers (the slaves).

The master is the server that manages the index. It receives update requests and applies those changes. A searcher, on the other hand, is a Solr server that exposes search services to external clients.

The index, in terms of data files, is replicated from the indexer to the searcher through HTTP by means of a built-in RequestHandler that must be configured on both the indexer side and searcher side (within the solrconfig.xml configuration file).

On the indexer (master), a replication configuration looks like this:

<requestHandler
   name="/replication"
 class="solr.ReplicationHandler">
   <lst name="master">
     <str name="replicateAfter">startup</str>
     <str name="replicateAfter">optimize</str>
     <str name="confFiles">schema.xml,stopwords.txt</str>
   </lst>
</requestHandler>

The replication mechanism can be configured to be triggered after one of the following events:

  • Commit: A commit has been applied
  • Optimize: The index has been optimized
  • Startup: The Solr instance has started

In the preceding example, we want the index to be replicated after startup and optimize commands. Using the confFiles parameter, we can also indicate a set of configuration files (schema.xml and stopwords.txt, in the example) that must be replicated together with the index.

Remember that changes on those files don’t trigger any replication. Only a change in the index, in conjunction with one of the events we defined in the replicateAfter parameter, will mark the index (and the configuration files) as replicable.

On the searcher side, the configuration looks like the following:

<requestHandler
name="/replication"
class="solr.ReplicationHandler">
<lst name="slave">
   <str name="masterUrl">http://<localhost>:<port>/solrmaster</str>
   <str name="pollInterval">00:00:10</str>
</lst>
</requestHandler>

You can see that a searcher periodically keeps polling the master (the pollInterval parameter) to check whether a newer version of the index is available. If it is, the searcher will start the replication mechanism by issuing a request to the master, which is completely unaware of the searchers.

The replicability status of the index is actually indicated by a version number. If the searcher has the same version as the master, it means the index is the same. If the versions are different, it means that a newer version of the index is available on the master, and replication can start.

Other than separating responsibilities, this deployment configuration allows us to have a so-called diamond architecture, consisting of one indexer and several searchers. When the replication is triggered, each searcher in the ring will receive a whole copy of the index. This allows the following:

  • Load balancing of the incoming (query) requests.
  • An increment to the availability of the whole system. In the event of a server crash, the other searchers will continue to serve the incoming requests.

The following diagram illustrates a master/slave deployment scenario with one indexer, three searchers, and two cores:

If the searchers are in several geographically dislocated data centers, an additional role called repeater can be configured in each data center in order to rationalize the replication data traffic flow between nodes. A repeater is simply a node that acts as both a master and a slave. It is a slave of the main master, and at the same time, it acts as master of the searchers within the same data center, as shown in this diagram:

Shards with replication

This scenario combines shards and replication in order to have a scalable system with high throughput and availability. There is one indexer and one or more searchers for each shard, allowing load balancing between (query) shard requests. The following diagram illustrates a scenario with two cores, three shards, one indexer, and (due to problems with available space), only one searcher for each shard:

The drawback of this approach is undoubtedly the overall growing complexity of the system that requires more effort in terms of maintainability, manageability, and system administration. In addition to this, each searcher is an independent node, and we don’t have a central administration console where a system administrator can get a quick overview of system health.

Summary

In this article, we described various ways in which you can deploy Solr. Each deployment scenario has specific features, advantages, and drawbacks that make a choice ideal for one context and bad for another. A good thing is that the different scenarios are not strictly exclusive; they follow an incremental approach. In an ideal context, things should start immediately with the perfect scenario that fits your needs. However, unless your requirements are clear right from the start, you can begin with a simple configuration and then change it, depending on how your application evolves.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here