How to query sharded data in MongoDB

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. This book covers the essential as well as advanced administration concepts in MongoDB.[/box]

Querying data using a MongoDB shard is different than a single server deployment or a replica set. Instead of connecting to the single server or the primary of the replica set, we connect to the mongos router which decides which shard to ask for our data. In this article, we will explore how the MongoDB query router operates and showcase how the process is similar to working with a replica set, using Ruby.

The query router

The query router, also known as mongos process, acts as the interface and entry point to our MongoDB cluster. Applications connect to it instead of connecting to the underlying shards and replica sets; mongos executes queries, gathers results, and passes them to our application.
mongos doesn't hold any persistent state and is typically low on system resources, and is typically hosted in the same instance as the application server.

It is acting as a proxy for requests. When a query comes in, mongos will examine and decide which shards need to execute the query and establish a cursor in each one of them.

The Find operation

If our query includes the shard key or a prefix of the shard key, mongos will perform a targeted operation, only querying the shards that hold the keys that we are looking for.

For example, with a composite shard key of {_id, email, address} on our collection

User, we can have a targeted operation with any of the following queries:

> db.User.find({_id: 1})

> db.User.find({_id: 1, email: 'alex@packt.com'})

> db.User.find({_id: 1, email: 'janluc@packt.com', address: 'Linwood

Dunn'})

All three of them are either a prefix (the first two) or the complete shard key.

On the other hand, a query on {email, address} or {address} will not be able to target the right shards, resulting in a broadcast operation.

A broadcast operation is any operation that doesn't include the shard key or a prefix of the shard key and results in mongos querying every shard and gathering results from them. It's also known as a scatter-and-gather operation or a fanout query.

Sort/limit/skip operations

If we want to sort our results, there are two options:

If we are using the shard key in our sort criteria, then mongos can determine the order in which it has to query the shard or shards. This results in an efficient and, again, targeted operation.
If we are not using the shard key in our sort criteria, then as with a query without sort, it's going to be a fanout query. To sort the results when we are not using the shard key, the primary shard executes a distributed merge sort locally before passing on the sorted result set to mongos.

Limit on queries is enforced on each individual shard and then again at the mongos level as there may be results from multiple shards.

Skip, on the other hand, cannot be passed on to individual shards and will be applied by mongos after retrieving all the results locally.

Update/remove operations

In document modifier operations like update and remove, we have a similar situation to find. If we have the shard key in the find section of the modifier, then mongos can direct the query to the relevant shard.

If we don't have the shard key in the find section, then it will again be a fanout operation.

In essence, we have the following cases for operations with sharding:

Type of operation	Query topology
insert	Must have the shard key
update	Can have the shard key
Query with shard key	Targeted operation
Query without shard key	Scatter gather/fanout query
Indexed/sorted query with shard key	Targeted operation
Indexed/sorted query without shard key	Distributed sort merge

Querying using Ruby

Connecting to a sharded cluster using Ruby is no different than connecting to a replica set. Using the Ruby official driver we have to configure the client object to define the set of mongos servers:

client = Mongo::Client.new('mongodb://key:password@mongos-server1-

Host:mongos-server1-port,mongos-server2-host:mongos-server2-

port/admin?ssl=true&authSource=admin')

The mongo-ruby-driver will then return a client object that is no different than connecting to a replica set from the Mongo Ruby client.

We can then use the client object like we did in previous chapters, with all the caveats around how sharding behaves differently than a standalone server or a replica set with regards to querying and performance.

Performance comparison with replica sets

Developers and architects are always looking out for ways to compare performance between replica sets and sharded configurations.

The way MongoDB implements sharding, it is based on top of replica sets. Every shard in production should be a replica set.

The main difference in performance comes from fan out queries. When we are querying without the shard key, MongoDB's execution time is limited by the worst-performing replica set.

In addition, when using sorting without the shard key, the primary server has to implement the distributed merge sort on the entire dataset. This means that it has to collect all data from different shards, merge-sort them, and pass them as sorted to mongos.

In both cases, network latency and limitations in bandwidth can slow down operations as opposed to a replica set. On the flip side, by having three shards, we can distribute our working set requirements across different nodes, thus serving results from RAM instead of reaching out to the underlying storage, HDD or SSD.

On the other hand, writes can be sped up significantly since we are no longer bound by a single node's I/O capacity but we can have writes in as many nodes as there are shards.

To sum up, in most cases and especially for the cases that we are using the shard key, both queries and modification operations will be significantly sped up by sharding.

If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on sharding, replication and other database administration tasks related to MongoDB.