In this article by Hrishikesh Vijay Karambelkar, author of the book Scaling Big Data with Hadoop and Solr - Second Edition, we will go through Apache Solr and MongoDB together. In an enterprise, data is generated from all the software that is participating in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires a storage system that is flexible enough to accommodate a data with varying data models. A NoSQL database, by its design, is best suited for this kind of storage requirements. One of the primary objectives of NoSQL is horizontal scaling, that is, the P in CAP theorem, but this works at the cost of sacrificing Consistency or Availability. Visit http://en.wikipedia.org/wiki/CAP_theorem to understand more about CAP theorem

(For more resources related to this topic, see here.)

What is NoSQL and how is it related to Big Data?

As we have seen, data models for NoSQL differ completely from that of a relational database. With the flexible data model, it becomes very easy for developers to quickly integrate with the NoSQL database, and bring in large sized data from different data sources. This makes the NoSQL database ideal for Big Data storage, since it demands that different data types be brought together under one umbrella. NoSQL also has different data models, like KV store, document store and Big Table storage.

In addition to flexible schema, NoSQL offers scalability and high performance, which is again one of the most important factors to be considered while running big data. NoSQL was developed to be a distributed type of database. When traditional relational stores rely on the high computing power of CPUs and the high memory focused on a centralized system, NoSQL can run on your low-cost, commodity hardware. These servers can be added or removed dynamically from the cluster running NoSQL, making the NoSQL database easier to scale. NoSQL enables most advanced features of a database, like data partitioning, index sharding, distributed query, caching, and so on.

Although NoSQL offers optimized storage for big data, it may not be able to replace the relational database. While relational databases offer transactional (ACID), high CRUD, data integrity, and a structured database design approach, which are required in many applications, NoSQL may not support them. Hence, it is most suited for Big Data where there is less possibility of need for data to be transactional.

MongoDB at glance

MongoDB is one of the popular NoSQL databases, (just like Cassandra). MongoDB supports the storing of any random schemas in the document oriented storage of its own. MongoDB supports the JSON-based information pipe for any communication with the server. This database is designed to work with heavy data. Today, many organizations are focusing on utilizing MongoDB for various enterprise applications.

MongoDB provides high availability and load balancing. Each data unit is replicated and the combination of a data with its copes is called a replica set. Replicas in MongoDB can either be primary or secondary. Primary is the active replica, which is used for direct read-write operations, while the secondary replica works like a backup for the primary. MongoDB supports searches by field, range queries, and regular expression searches. Queries can return specific fields of documents and also include user-defined JavaScript functions. Any field in a MongoDB document can be indexed. More information about MongoDB can be read at https://www.mongodb.org/.

The data on MongoDB is eventually consistent. Apache Solr can be used to work with MongoDB, to enable database searching capabilities on a MongoDB-based data store. Unlike Cassandra, where the Solr indexes are stored directly in Cassandra through solandra, MongoDB integration with Solr brings in the indexes in the Solr-based optimized storage.

There are various ways in which the data residing in MongoDB can be analyzed and searched. MongoDB's replication works by recording all operations made on a database in a log file, called the oplog (operation log). Mongo's oplog keeps a rolling record of all operations that modify the data stored in your databases. Many of the implementers suggest reading this log file using a standard file IO program to push the data directly to Apache Solr, using CURL, SolrJ. Since oplog is a collection of data with an upper limit on maximum storage, it is feasible to synch such querying with Apache Solr. Oplog also provides tailable cursors on the database. These cursors can provide a natural order to the documents loaded in MongoDB, thereby, preserving their order. However, we are going to look at a different approach. Let's look at the schematic following diagram:

apache-solr-and-big-data-integration-mongodb-img-0

In this case, MongoDB is exposed as a database to Apache Solr through the custom database driver. Apache Solr reads MongoDB data through the DataImportHandler, which in turns calls the JDBC-based MongoDB driver for connecting to MongoDB and running data import utilities. Since MongoDB supports replica sets, it manages the distribution of data across nodes. It also supports Sharding just like Apache Solr.

Installing MongoDB

To install MongoDB in your development environment, please follow the following steps:

Download the latest version of MongoDB from https://www.mongodb.org/downloads for your supported operating system.

Unzip the zipped folder.

MongoDB comes up with a default set of different command-line components and utilities:
- bin/mongod: The database process.
- bin/mongos: Sharding controller.
- bin/mongo: The database shell (uses interactive JavaScript).

Now, create a directory for MongoDB, which it will use for user data creation and management, and run the following command to start the single node server:
```
$ bin/mongod –dbpath <path to your data directory> --rest
```
In this case, --rest parameter enables support for simple rest APIs that can be used for getting the status.

Once the server is started, access http://localhost:28017 from your favorite browser, you should be able to see following administration status page:

Now that you have successfully installed MongoDB, try loading a sample data set from the book on MongoDB by opening a new command-line interface. Change the directory to $MONGODB_HOME and run the following command:

$ bin/mongoimport --db solr-test --collection zips --file "<file-dir>/samples/zips.json"

Please note that the database name is solr-test. You can see the stored data using the MongoDB-based CLI by running the following set of commands from your shell:

$ bin/mongo
MongoDB shell version: 2.4.9
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
       http://docs.mongodb.org/
Questions? Try the support group
       http://groups.google.com/group/mongodb-user
> use test
Switched to db test
> show dbs
exampledb       0.203125GB
local   0.078125GB
test   0.203125GB
> db.zips.find({city:"ACMAR"})
{ "city" : "ACMAR", "loc" : [ -86.51557, 33.584132 ], "pop" : 6055, "state" :"AL", "_id" : "35004" }

Congratulations! MongoDB is installed successfully

Creating Solr indexes from MongoDB

To run MongoDB as a database, you will need a JDBC driver built for MongoDB. However, the Mongo-JDBC driver has certain limitations, and it does not work with the Apache Solr DataImportHandler. So, I have extended Mongo-JDBC to work under the Solr-based DataImportHandler. The project repository is available at https://github.com/hrishik/solr-mongodb-dih. Let's look at the setting-up procedure for enabling MongoDB based Solr integration:

You may not require a complete package from the solr-mongodb-dih repository, but just the jar file. This can be downloaded from https://github.com/hrishik/solr-mongodb-dih/tree/master/sample-jar. You will also need the following additional jar files:
- jsqlparser.jar
- mongo.jar
These jars are available with the book Scaling Big Data with Hadoop and Solr, Second Edition for download.

In your Solr setup, copy these jar files into the library path, that is, the $SOLR_WAR_LOCATION/WEB-INF/lib folder. Alternatively, point your container classpath variable to link them up.

Using simple Java source code DataLoad.java (link https://github.com/hrishik/solr-mongodb-dih/blob/master/examples/DataLoad.java), populate the database with some sample schema and tables that you will use to load in Apache Solr.

Now create a data source file (data-source-config.xml) as follows:

<dataConfig>
<dataSource name="mongod" type="JdbcDataSource" driver="com.mongodb.
jdbc.MongoDriver" url="mongodb://localhost/solr-test"/>
<document>
   <entity name="nameage" dataSource="mongod" query="select name, price from grocery">
       <field column="name" name="name"/>
       <field column="name" name="id"/>
       <!-- other files -->
   </entity>
</document>
</dataConfig>

Copy the solr-dataimporthandler-*.jar from your contrib directory to a container/application library path.

Modify $SOLR_COLLECTION_ROOT/conf/solr-config.xml with DIH entry:

<!-- DIH Starts -->
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
   <lst name="defaults">
   <str name="config"><path to config>/data-source-config.xml</str>
   </lst>
</requestHandler>
   <!-- DIH ends -->

Once this configuration is done, you are ready to test it out. Access http://localhost:8983/solr/dataimport?command=full-import from your browser to run the full import on Apache Solr, where you will see that your import handler has successfully ran, and has loaded the data in Solr store, as shown in the following screenshot:

You can validate the content created by your new MongoDB DIH by accessing the Solr Admin page, and running a query:

apache-solr-and-big-data-integration-mongodb-img-3

Using this connector, you can perform operations for full-import on various data elements. Since MongoDB is not a relational database, it does support join queries. However, it supports selects, order by, and so on.