Categories: High PerformanceProgrammingTutorials

A Different Kind of Database

8 min read

(For more resources related to this topic, see here.)

Explosive growth

Relational databases worked well when systems were serving hundreds or even thousands of users, but the Internet has changed all of that. The number of users and volume of data is growing exponentially. A variety of social applications have proved that applications can quickly attract millions of users. Relational databases were never built to handle this level of concurrent access.

Semi-structured data

In addition to the staggering growth, data is no longer simple rows and columns. Semi-structured data is everywhere. Extensible Markup Language (XML) and JavaScript Object Notation (JSON) are the lingua franca of our distributed applications. These formats allow complex relationships to be modeled through hierarchy and nesting. Relational databases struggle to effectively represent these data patterns. Due to this impedance mismatch, our applications are littered with additional complexity. Object relational mapping (ORM) tools have helped but not solved this problem.

With the growth of Software as a Service (SaaS) and cloud-based applications, the need for flexible schemas has increased. Each tenant is hosted on a unified infrastructure but they must retain the flexibility to customize their data model to meet their unique business needs. In these multi-tenant environments, a rigid schema structure imposed by a relational database does not work.

Architecture changes

While data is still king, how we architect our data-dependent systems has changed significantly over the past few decades. In many systems, the database acted as the integration point for different parts of the application. This required the data to be stored in a uniform way since the database was acting as a form of API. The following diagram shows the architectural transitions:

With the move to Service Oriented Architectures (SOA), how data is stored for a given component has become less important. The application interfaces with the service, not the database. The application has a dependency on the service contract, not on the database schema. This shift has opened up the possibilities to store data based on the needs of the service.

Rethinking the database

The factors we have been discussing have led many in our industry to rethink the idea of a database. Engineers wrestled with the limitations of the relational database and set out to build modern web-scale databases. The term NoSQL was coined to label this category of databases. Originally, the term stood for No SQL but has evolved to mean Not Only SQL. To confuse matters further, some NoSQL databases support a form of the SQL dialect. However, in all cases they are not relational databases.

While the NoSQL landscape continues to expand with more projects and companies getting in the action, there are four basic categories that databases fall into:

Document (CouchDB, MongoDB, RavenDB)
Graph (Neo4J, Sones)
Key/Value (Cassandra, SimpleDB, Dynamo, Voldemort)
Tabular/Wide Column (BigTable, Apache Hbase)

Document databases

Document databases are made up of semi-structure and schema free data structures known as documents. In this case, the term document is not speaking of a PDF or Word document. Rather, it refers to a rich data structure that can represent related data from the simple to the complex. In document databases, documents are usually represented in JavaScript Object Notation (JSON). A document can contain any number of fields of any length. Fields can also contain multiple pieces of data. Each document is independent and contains all of the data elements required by the entity.

The following is an example of a simple document:

{
Name: "Alexander Graham Bell",
BornIn: "Edinburgh, United Kingdom",
Spouse: "Mabel Gardiner Hubbard"
}

And the following is an example of a more complex document:

{
  Name: "Galileo Galilei", 
  BornIn: "Pisa, Italy",YearBorn: "1564",
  Children: [  
{ Name: "Virginia", YearBorn: "1600" },
{ Name: "Vincenzo", YearBorn: "1606" }
]}

Since documents are JSON-based, the impedance mismatch that exists between the object-oriented and relational database worlds is gone. An object graph is simply serialized into JSON for storage. Now, the complexity of the entity has a small impact on the performance. Entire object graphs can be read and written in one database operation. There is no need to perform a series of select statements or create complex stored procedures to read the related objects.

JSON documents also add flexibility due to their schema free design. This allows for evolving systems without forcing the existing data to be restructured. The schema free nature simplifies data structure evolution and customization. However, care must be given to the evolving data structure. If the evolution is a breaking change, documents must be migrated or additional intelligence needs to be built into the application.

A document database for the .NET platform

Prior to RavenDB, document databases such as CouchDB treated .NET as an afterthought. In 2010, Oren Eini from Hibernating Rhinos decided to bring a powerful document database to the .NET ecosystem. According to his blog:

Raven is an OSS (with a commercial option) document database for the .NET/Windows platform. While there are other document databases around, such as CouchDB or MongoDB, there really isn’t anything that a .NET developer can pick up and use without a significant amount of friction. Those projects are excellent in what they do, but they aren’t targeting the .NET ecosystem.

RavenDB is built to be a first-class citizen on the .NET platform offering developers the ability to easily extend and embed the database in their applications. A few of the key features that make RavenDB compelling to .NET developers are as follows:

RavenDB comes with a fully functional .NET client API, which implements unit of work, change tracking, read and write optimizations, and much more. It also has a REST-based API, so you can access it via the JavaScript directly.
It allows developers to define indexes using LINQ (Language Integrated Queries). Supports map/reduce operations on top of your documents using LINQ.
It supports System.Transactions and can take part in distributed transactions.
The server can be easily extended by adding a custom .NET assembly.

RavenDB architecture

RavenDB leverages existing storage infrastructure called ESENT that is known to scale to amazing sizes. ESENT is the storage engine utilized by Microsoft Exchange and Active Directory. The storage engine provides the transactional data store for the documents. RavenDB also utilizes another proven technology called Lucene.NET for its high-speed indexing engine. Lucene.NET is an open source Apache project used to power applications such as AutoDesk, StackOverflow, Orchard, Umbraco, and many more.

The following diagram shows the primary components of the RavenDB architecture:

Storing documents

When a document is inserted or updated, RavenDB performs the following:

A document change comes in and is stored in ESENT. Documents are immediately available to load by ID, but won’t appear in searches until they are indexed.
Asynchronous indexing task takes work from the queue and updates the Lucene index. The index can be created manually or dynamically based on the queries executed by the application.
The document now appears in queries. Typically, index updates have an average latency of 20 milliseconds. RavenDB provides an API to wait for updates to be indexed if needed.

Searching and retrieving documents

When a document request comes in, the server is able to pull them directly from the RavenDB database when a document ID is provided. All searches and other inquiries hit the Lucene index. These methods provide near instant access, regardless of the database size.

A key difference between RavenDB and a relational database is the way index consistency is handled. A relational database ties index updates to data modifications. The insert, update, or delete only completes once the indexes have been updated. This provides users a consistent view of the data but can quickly degrade when the system is under heavy load.

RavenDB on the other hand uses a model for indexes known as eventual consistency. Indexes are updated asynchronously from the document storage. This means that the visibility of a change within an index is not always available immediately after the document is written. By queuing the indexing operation on a background thread, the system is able to continue servicing reads while the indexing operation catches up. Eventual consistency is a bit counter-intuitive. We do not want the user to view stale data. However, in a multiuser system our users view stale data all the time. Once the data is displayed on the screen, it becomes stale and may have been modified by another user.

The following diagram illustrates stale data in a multiuser system:

In many cases, this staleness does not matter. Consider a blog post. When you publish a new article, does it really matter if the article becomes visible to the entire world that nanosecond? Will users on the Internet really know if it wasn’t? What typically matters is providing feedback to the user who made the change. Either let them know when the change becomes available or pausing briefly while the indexing catches up. If a user did not initiate the data change, then it is even easier. The change will simply become available when it enters the index. This provides a mechanism to give each user personal consistency. The user making the change can wait for their own changes to take affect while other users don’t need to wait.

Eventual consistency is a tradeoff between application responsiveness for all users and consistency between indexes and documents. When used appropriately, this tradeoff becomes a tool for increasing the scalability of a system.

Summary

As you can see, RavenDB is truly a different kind of database. It makes fundamental changes to what we expect from a database. It requires us to approach problems from a fresh perspective. It requires us to think differently.