NoSQL, or non-relational databases, are increasingly used in big data and real-time web applications. These databases are non-relational in nature and they provide a mechanism for storage and the retrieval of information that is not tabular.
There are many advantages of using NoSQL database:
- Horizontal Scalability
- Automatic replication (using multiple nodes)
- Loosely defined or no schema (Huge advantage, if you ask me!)
- Sharding and distribution
Recently we were discussing the possibility of changing our data storage from HDF5 files to some NoSQL system. HDF5 files are great for the storage and retrieval purposes. But now with huge data coming in we need to scale up, and also the hierarchical schema of HDF5 files is not very well suited for all sorts of data we are using. I am a bioinformatician working on data science applications to genomic data. We have genomic annotation files (GFF format), genotype sequences (FASTA format), phenotype data (tables), and a lot of other data formats. We want to be able to store data in a space and memory efficient way and also the framework should facilitate fast retrieval. I did some research on the NoSQL options and prepared this cheat-sheet. This will be very useful for someone thinking about moving their storage to non-relational databases. Also, data scientists need to be very comfortable with the basic ideas of NoSQL DB’s. In the course Introduction to Data Science by Prof. Bill Howe (UWashinton) on Coursera, NoSQL DB’s formed a significant part of the lectures. I highly recommend the lectures on these topics and this course in general. This cheat-sheet should also assist aspiring data scientists in their interviews.
Some options for NoSQL databases:
Membase:
This is key-value type database. It is very efficient if you only need to quickly retrieve a value according to a key. It has all of the advantages of memcached when it comes to the low cost of implementation. There is not much emphasis on scalability, but lookups are very fast. It has a JSON format with no predefined schema. The weakness of using it for important data is that it’s a pure key-value store, and thus is not queryable on properties.
MongoDB:
If you need to associate a more complex structure, such as a document to a key, then MongoDB is a good option. With a single query, you are going to retrieve the whole document and it can be a huge win. However using these documents like simple key/value stores would not be as fast and as space-efficient as Membase.
- Documents are the basic unit.
- Documents are in JSON format with no predefined schema.
- It makes integration of data easier and faster.
Berkeley DB:
It stores records in key-value pairs. Both key and value can be arbitrary byte strings, and can be of variable lengths. You can put native programming language data structures into the database without converting to a foreign record first. Storage and retrieval are very simple, but the application needs to know what the structure of a key and a value is in advance, it can’t ask the DB.
- Simple data access services.
- No limit to the data types that can be stored.
- No special support for binary large objects (unlike some others)
Berkeley DB v/s MongoDB:
- Berkeley DB has no partitioning while MongoDB supports sharding.
- MongoDB has some predefined data types like float, string, integer, double, boolean, date, and so on.
- Berkeley DB has key-value store and MongoDb has documents.
- Both are schema free.
- Berkeley DB has no support for Python, for example, although there are many third parties libraries.
Redis:
If you need more structures like lists, sets, ordered sets and hashes, then Redis is the best bet. It’s very fast and provides useful data-structures. It just works, but don’t expect it to handle every use-case. Nevertheless, it is certainly possible to use Redis as your primary data-store. But it is used less for distributed scalability, but optimizes high performance lookups at the cost of no longer supporting relational queries.
Cassandra:
Each key has values as columns and columns are grouped together into sets called column families. Thus each key identifies a row of a variable number of elements.
- A column family contains rows and columns. Each row is uniquely identified by a key. And each row has multiple columns. Think of a column family as a table, each key-value pair being a row. Unlike RDBMS, different rows in a column family don’t have to share the same set of columns, and a column may be added to one or multiple rows at any time.
- A hybrid between a key-value and a column-oriented database.
- Has a partially defined schema.
- Can handle large amounts of data across many servers (clusters), is fault-tolerant and robust.
- Examples were originally written by Facebook for the Inbox search, and later replaced by HBase.
HBase:
It is modeled after Google’s Bigtable DB. The deal use for HBase is in the situations when you need improved flexibility, great performance, scaling and have Big Data. The data structure is similar to Cassandra where you have column families.
- Built on Hadoop (HDFS), and can do MapReduce without any external support.
- Very efficient for storing sparse data .
- Big data (2 billion rows) is easy to deal with.
- Examples scalable email/messaging system with search.
HBase V/S Cassandra:
Hbase is more suitable for data warehousing and large scale data processing and analysis (indexing the web as in a search engine) and Cassandra is more apt for real time transaction processing and the serving of interactive data.
- Cassandra is more write-centric and HBase is more read-centric.
- Cassandra has multi- data center support, which can be very useful.
Resources
About the Author
Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.