5 min read

(For more resources related to this topic, see here.)

General tips

Before diving into some advanced strategies for improving performance and scalability, let’s briefly recap some of the general performance tips already spread across the book:

  • When mapping your entity classes for Hibernate Search, use the optional elements of the @Field annotation to strip the unnecessary bloat from your Lucene indexes:

    • If you are definitely not using index-time boosting , then there is no reason to store the information needed to make this possible. Set the norms element to Norms.NO .

    • By default, the information needed for a projection-based query is not stored unless you set the store element to Store.YES or Store. COMPRESS. If you had projection-based queries that are no longer being used, then remove this element as part of the cleanup.

  • Use conditional indexing and partial indexing to reduce the size of Lucene indexes.

  • Rely on filters to narrow your results at the Lucene level, rather than using a WHERE clause at the database query level.

  • Experiment with projection-based queries wherever possible , to reduce or eliminate the need for database calls. Be aware that with advanced database caching, the benefits might not always justify the added complexity.

  • Test various index manager options , such as trying the near-real-time index manager or the async worker execution mode.

Running applications in a cluster

Making modern Java applications scale in a production environment usually involves running them in a cluster of server instances. Hibernate Search is perfectly at home in a clustered environment, and offers multiple approaches for configuring a solution.

Simple clusters

The most straightforward approach requires very little Hibernate Search configuration. Just set up a file server for hosting your Lucene indexes and make it available to every server instance in your cluster (for example, NFS, Samba, and so on):

A simple cluster with multiple server nodes using a common Lucene index on a shared drive

Each application instance in the cluster uses the default index manager, and the usual filesystem directory provider.

In this arrangement, all of the server nodes are true peers. They each read from the same Lucene index, and no matter which node performs an update, that node is responsible for the write. To prevent corruption, Hibernate Search depends on simultaneous writes being blocked, by the locking strategy (that is, either “simple” or “native”).

Recall that the “near-real-time” index manager is explicitly incompatible with a clustered environment.

The advantage of this approach is two-fold. First and foremost is simplicity. The only steps involved are setting up a filesystem share, and pointing each application instance’s directory provider to the same location. Secondly, this approach ensures that Lucene updates are instantly visible to all the nodes in the cluster.

However, a serious downside is that this approach can only scale so far. Very small clusters may work fine, but larger numbers of nodes trying to simultaneously access the same shared files will eventually lead to lock contention.

Also, the file server on which the Lucene indexes are hosted is a single point of failure. If the file share goes down, then your search functionality breaks catastrophically and instantly across the entire cluster.

Master-slave clusters

When your scalability needs outgrow the limitations of a simple cluster, Hibernate Search offers more advanced models to consider. The common element among them is the idea of a master node being responsible for all Lucene write operations.

Clusters may also include any number of slave nodes. Slave nodes may still initiate Lucene updates, and the application code can’t really tell the difference. However, under the covers, slave nodes delegate that work to be actually performed by the master node.

Directory providers

In a master-slave cluster, there is still an “overall master” Lucene index, which logically stands apart from all of the nodes. This may be filesystem-based, just as it is with a simple cluster. However, it may instead be based on JBoss Infinispan (http://www.jboss.org/infinispan), an open source in-memory NoSQL datastore sponsored by the same company that principally sponsors Hibernate development:

  • In a filesystem-based approach, all nodes keep their own local copies of the Lucene indexes. The master node actually performs updates on the overall master indexes, and all of the nodes periodically read from that overall master to refresh their local copies.

  • In an Infinispan-based approach, the nodes all read from the Infinispan index (although it is still recommended to delegate writes to a master node). Therefore, the nodes do not need to maintain their own local index copies. In reality, because Infinispan is a distributed datastore, portions of the index will reside on each node anyway. However, it is still best to visualize the overall index as a separate entity.

Worker backends

There are two available mechanisms by which slave nodes delegate write operations to the master node:

  • A JMS message queue provider creates a queue, and slave nodes send messages to this queue with details about Lucene update requests. The master node monitors this queue, retrieves the messages, and actually performs the update operations.

  • You may instead replace JMS with JGroups (http://www.jgroups.org), an open source multicast communication system for Java applications. This has the advantage of being faster and more immediate. Messages are received in real-time, synchronously rather than asynchronously.

    However, JMS messages are generally persisted to a disk while awaiting retrieval, and therefore can be recovered and processed later, in the event of an application crash. If you are using JGroups and the master node goes offline, then all the update requests sent by slave nodes during that outage period will be lost. To fully recover, you would likely need to reindex your Lucene indexes manually.

    A master-slave cluster using a directory provider based on filesystem or Infinispan, and worker based on JMS or JGroups. Note that when using Infinispan, nodes do not need their own separate index copies.

 

Summary

In this article, we explored the options for running applications in multi-node server clusters, to spread out and handle user requests in a distributed fashion. We also learned how to use sharding to help make our Lucene indexes faster and more manageable.

Resources for Article :


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here