41 min read

 In this article by Mohsin Hijazee, the author of the book Mastering Google App Engine, we will go through learning, but unlearning something is even harder. The main reason why learning something is hard is not because it is hard in and of itself, but for the fact that most of the times, you have to unlearn a lot in order to learn a little. This is quite true for a datastore. Basically, it is built to scale the so-called Google scale. That’s why, in order to be proficient with it, you will have to unlearn some of the things that you know. Your learning as a computer science student or a programmer has been deeply enriched by the relational model so much so that it is natural to you. Anything else may seem quite hard to grasp, and this is the reason why learning Google datastore is quite hard.

However, if this were the only glitch in all that, things would have been way simpler because you could ask yourself to forget the relational world and consider the new paradigm afresh. Things have been complicated due to Google’s own official documentation, where it presents a datastore in a manner where it seems closer to something such as Django’s ORM, Rails ActiveRecord, or SQLAlchemy. However, all of a sudden, it starts to enlist its limitations with a very brief mention or, at times, no mention of why the limitations exist.

Since you only know the limitations but not why the limitations are there in the first place, a lack of reason may result to you being unable to work around those limitations or mold your problem space into the new solution space, which is Google datastore.

We will try to fix this. Hence, the following will be our goals in this article:

  • To understand BigTable and its data model
  • To have a look at the physical data storage in BigTable and the operations that are available in it
  • To understand how BigTable scales
  • To understand datastore and the way it models data on top of BigTable

So, there’s a lot more to learn. Let’s get started on our journey of exploring datastore.

The BigTable

If you decided to fetch every web page hosted on the planet, download and store a copy of it, and later process every page to extract data from it, you’ll find out that your own laptop or desktop is not good enough to accomplish this task. It has barely enough storage to store every page. Usually, laptops come with 1 TB hard disk drives, and this seems to be quite enough for a person who is not much into video content such as movies.

Assuming that there are 2 billion websites, each with an average of 50 pages and each page weighing around 250 KB, it sums up to around 23,000+ TB (or roughly 22 petabytes), which would need 23,000 such laptops to store all the web pages with a 1 TB hard drive in each.

Assuming the same statistics, if you are able to download at a whopping speed of 100 MBps, it would take you about seven years to download the whole content to one such gigantic hard drive if you had one in your laptop.

Let’s suppose that you downloaded the content in whatever time it took and stored it. Now, you need to analyze and process it too. If processing takes about 50 milliseconds per page, it would take about two months to process the entire data that you downloaded. The world would have changed a lot by then already, leaving your data and processed results obsolete.

This is the Kind of scale for which BigTable is built. Every Google product that you see—Search Analytics, Finance, Gmail, Docs, Drive, and Google Maps—is built on top of BigTable. If you want to read more about BigTable, you can go through the academic paper from Google Research, which is available at http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.

The data model

Let’s examine the data model of BigTable at a logical level. BigTable is basically a key-value store. So, everything that you store falls under a unique key, just like PHP’ arrays, Ruby’s hash, or Python’s dict:

# PHP
$person['name'] = 'Mohsin';
# Ruby or Python
person['name'] = 'Mohsin'

However, this is a partial picture. We will learn the details gradually in a while. So, let’s understand this step by step.

A BigTable installation can have multiple tables, just like a MySQL database can have multiple tables. The difference here is that a MySQL installation might have multiple databases, which in turn might have multiple tables. However, in the case of BigTable, the first major storage unit is a table.

Each table can have hundreds of columns, which can be divided into groups called column families. You can define column families at the time of creating a table. They cannot be altered later, but each column family might have hundreds of columns that you can define even after the creation of the table. The notation that is used to address a column and its column families is like job:title, where job is a column family and title is the column. So here, you have a job column family that stores all the information about the job of the user, and title is supposed to store the job title. However, one of the important facts about these columns is that there’s no concept of datatypes in BigTable as you’d encounter in other relational database systems. Everything is just an uninterpreted sequence of bytes, which means nothing to BigTable. What they really mean is just up to you. It might be a very long integer, a string, or a JSON-encoded data.

Now, let’s turn our attention to the rows. There are two major characteristics of the rows that we are concerned about. First, each row has a key, which must be unique. The contents of the key again consist of an uninterpreted string of bytes that is up to 64 KB in length. A key can be anything that you want it to be. All that’s required is that it must be unique within the table, and in case it is not, you will have to overwrite the contents of the row with the same content.

Which key should you use for a row in your table? That’s the question that requires some consideration. To answer this, you need to understand how the data is actually stored. Till then, you can assume that each key has to be a unique string of bytes within the scope of a table and should be up to 64 KB in length.

Now that we know about tables, column families, columns, rows, and row keys, let’s look at an example of BigTable that stores ’employees’ information. Let’s pretend that we are creating something similar to LinkedIn here. So, here’s the table:

Personal

Professional

Key(name)

personal:lastname

personal:age

professinal:company

professional:designation

Mohsin

Hijazee

29

Sony

Senior Designer

Peter

Smith

34

Panasonic

General Manager

Kim

Yong

32

Sony

Director

Ricky

Martin

45

Panasonic

CTO

Paul

Jefferson

39

LG

Sales Head

So, ‘this is a sample BigTable. The first column is the name, and we have chosen it as a key. It is of course not a good key, because the first name cannot necessarily be unique, even in small groups, let alone in millions of records. However, for the sake of this example, we will assume that the name is unique. Another reason behind assuming the name’s uniqueness is that we want to increase our understanding gradually. So, the key point here is that we picked the first name as the row’s key for now, but we will improve on this as we learn more.

Next, we have two column groups. The personal column group holds all the personal attributes of the employees, and the other column family named professional has all the other attributes pertaining to the professional aspects. When referring to a column within a family, the notation is family:column. So, personal:age contains the age of the employees.

If you look at professinal:designation and personal:age, it seems that the first one’s contents are strings, while the second one stores integers. That’s false. No column stores anything but just plain bytes without any distinction of what they mean. The meaning and interpretation of these bytes is up to the user of the data. From the point of view of BigTable’, each column just contains plain old bytes. Another thing that is drastically different from RDBMS is such as MySQL is that each row need not have the same number of columns. Each row can adopt the layout that they want. So, the second row’s personal column family can have two more columns that store gender and nationality.

For this particular example, the data is in no particular order, and I wrote it down as it came to my mind. Hence, there’s no order of any sort in the data at all.

To summarize, BigTable is a key-value storage where keys should be unique and have a length that is less than or equal to 64 KB. The columns are divided into column families, which can be created at the time of defining the table, but each column family might have hundreds of columns created as and when needed. Also, contents have no data type and comprise just plain old bytes.

There’s one minor detail left, which is not important as regards our purpose. However, for the sake of the completeness of the BigTable’s data model, I will mention it now. Each value of the column is stored with a timestamp that is accurate to the microseconds, and in this way, multiple versions of a column value are available. The number of last versions that should be kept is something that is configurable at the table level, but since we are not going to deal with BigTable directly, this detail is not important to us.

How data is stored?

Now that we know about row keys, column families, and columns, we will gradually move towards examining this data model in detail and understand how the data is actually stored. We will examine the logical storage and then dive into the actual structure, as it ends up on the disk.

The data that we presented in the earlier table had no order and were listed as they came to my mind. However, while storing, the data is always sorted by the row key. So now, the data will actually be stored like this:

personal

professional

Key(name)

personal:lastname

personal:age

professinal:company

professional:designation

Kim

Yong

32

Sony

Director

Mohsin

Hijazee

29

Sony

Senior Designer

Paul

Jefferson

39

LG

Sales Head

Peter

Smith

34

Panasonic

General Manager

Ricky

Martin

45

Panasonic

CTO

OK, so what happened here? The name column indicates the key of the table and now, the whole table is sorted by the key. That’s exactly how it is stored on the disk as well. ‘An important thing about sorting is lexicographic sorting and not semantic sorting. By lexicographic, we mean that they are sorted by the byte value and not by the textness or the semantic sort. This matters because even within the Latin character set, different languages have different sort orders for letters, such as letters in English versus German and French. However, all of this and the Unicode collation order isn’t valid here. It is just sorted by byte values. In our instance, since K has a smaller byte value (because K has a lower ASCII/Unicode value) than letter M, it comes first. Now, suppose that some European language considers and sorts M before K. That’s not how the data would be laid out here, because it is a plain, blind, and simple sort. The data is sorted by the byte value, with no regard for the semantic value. In fact, for BigTable, this is not even text. It’s just a plain string of bytes.

Just a hint. This order of keys is something that we will exploit when modeling data. How? We’ll see later.

The Physical storage

Now that we understand the logical data model and how it is organized, it’s time to take a closer look at how this data is actually stored on the disk. On a physical disk, the stored data is sorted by the key. So, key 1 is followed by its respective value, key 2 is followed by its respective value, and so on. At the end of the file, there’s a sorted list of just the keys and their offset in the file from the start, which is something like the block to the right:

Ignore the block on your left that is labeled Index. We will come back to it in a while. This particular format actually has a name SSTable (String Storage Table) because it has strings (the keys), and they are sorted. It is of course tabular data, and hence the name.

Whenever your data is sorted, you have certain advantages, with the first and foremost advantage being that when you look up for an item or a range of items, ‘your dataset is sorted. We will discuss this in detail later in this article. Now, if we start from the beginning of the file and read sequentially, noting down every key and then its offset in a format such as key:offset, we effectively create an index of the whole file in a single scan. That’s where the first block to your left in the preceding diagram comes from. Since the keys are sorted in the file, we simply read it sequentially till the end of the file, hence effectively creating an index of the data. Furthermore, since this index only contains keys and their offsets in the file, it is much smaller in terms of the space it occupies.

Now, assuming that SSTable has a table that is, say, 500 MB in size, we only need to load the index from the end of the file into the memory, and whenever we are asked for a key or a range of keys, we just search within a memory index (thus not touching the disk at all). If we find the data, only then do we seek the disk at the given offset because we know the offset of that particular key from the index that we loaded in the memory.

Some limitations

Pretty smart, neat, and elegant, you would say! Yes it is. However, there’s a catch. If you want to create a new row, key must come in a sorted order, and even if you are sure about where exactly this key should be placed in the file to avoid the need to sort the data, you still need to rewrite the whole file in a new, sorted order along with the index. Hence, large amounts of I/O are required for just a single row insertion.

The same goes for deleting a row because now, the file should be sorted and rewritten again. Updates are OK as long as the key itself is not altered because, in that case, it is sort of having a new key altogether. This is so because a modified key would have a different place in the sorted order, depending on what the key actually is. Hence, the whole file would be rewritten. Just for an example, say you have a row with the key as all-boys, and then you change the key of that row to x-rays-of-zebra. Now, you will see that after the new modification, the row will end up at nearly the end of the file, whereas previously, it was probably at the beginning of the file because all-boys comes before x-rays-of-zebra when sorted.

This seems pretty limiting, and it looks like inserting or removing a key is quite expensive. However, this is not the case, as we will see later.

Random writes and deletion

There’s one last thing that’s worth a mention before we examine the operations that are available on a BigTable. We’d like to examine how random writes and the deletion of rows are handled because that seems quite expensive, as we just examined in the preceding section.

The idea is very simple. All the read, writes, and removals don’t go straight to the disk. Instead, an in-memory SSTable is created along with its index, both of which are empty when created. We’ll call it MemTable from this point onwards for the sake of simplicity. Every read checks the index of this table, and if a record is found from here, it’s well and good. If it is not, then the index of the SSTable on the disk is checked and the desired row is returned.

When a new row has to be read, we don’t look at anything and simply enter the row in the MemTable along with its record in the index of this MemTable.

To delete a key, we simply mark it deleted in the memory, regardless of whether it is in MemTable or in the on disk table. As shown here the allocation of block into Mem Table:

Now, when the size of the MemTable grows up to a certain size, it is written to the disk as a new SSTable. Since this only depends on the size of the MemTable and of course happens much infrequently, it is much faster. Each time the MemTable grows beyond a configured size, it is flushed to the disk as a new SSTable. However, the index of each flushed SSTable is still kept in the memory so that we can quickly check the incoming read requests and locate it in any table without touching the disk.

Finally, when the number of SSTables reaches a certain count, the SSTables are merged and collapsed into a single SSTable. Since each SSTable is just a sorted set of keys, a merge sort is applied. This merging process is quite fast.

Congratulations! You’ve just learned the most atomic storage unit in BigData solutions such as BigTable, Hbase, Hypertable, Cassandara, and LevelDB. That’s how they actually store and process the data.

Now that we know how a big table is actually stored on the disk and how the read and writes are handled, it’s time to take a closer look at the available operations.

Operations on BigTable

Until this point, we know that a BigTable table is a collection of rows that have unique keys up to 64 KB in length and the data is stored according to the lexicographic sort order of the keys. We also examined how it is laid out on the disk and how read, writes, and removals are handled. Now, the question is, which operations are available on this data? The following are the operations that are available to us:

  • Fetching a row by using its key
  • Inserting a new key
  • Deleting a row
  • Updating a row
  • Reading a range of rows from the starting row key to the ending row key

Reading

Now, the first operation is pretty simple. You have a key, and you want the associated row. Since the whole data set is sorted by the key, all we need to do is perform a binary search on it, and you’ll be able to locate your desired row within a few lookups, even within a set of a million rows. In practice, the index at the end of the SSTable is loaded in the memory, and the binary search is actually performed on it. If we take a closer look at this operation in light of what we know from the previous section, the index is already in the memory of the MemTable that we saw in the previous section. In case there are multiple SSTables because MemTable was flushed many times to the disk as it grew too large, all the indexes of all the SSTables are present in the memory, and a quick binary search is performed on them.

Writing

The second operation that is available to us is the ability to insert a new row. So, we have a key and the values that we want to insert in the table. According to our new knowledge about physical storage and SSTables, we can understand this very well. The write directly happens on the in-memory MemTable and its index is updated, which is also in the memory. Since no disk access is required to write the row as we are writing in memory, the whole file doesn’t have to be rewritten on disk, because yet again, all of it is in the memory. This operation is very fast and almost instantaneous. However, if the MemTable grows in size, it will be flushed to the disk as a new SSTable along with the index while retaining a copy of its index in the memory. Finally, we also saw that when the number of SSTables reaches a certain number, they are merged and collapsed to form a new, bigger table.

Deleting

It seems that since all the keys are in a sorted order on the disk and deleting a key would mean disrupting the sort order, a rewrite of the whole file would be a big I/O overhead. However, it is not, as it can be handled smartly. Since all the indexes, including the MemTable and the tables that were the result of flushing a larger MemTable to the disk, are already in the memory, deleting a row only requires us to find the required key in the in-memory indexes and mark it as deleted.

Now, whenever someone tries to read the row, the in-memory indexes will be checked, and although an entry will be there, it will be marked as deleted and won’t be returned. When MemTable is being flushed to the disk or multiple tables are being collapsed, this key and the associated row will be excluded in the write process. Hence, they are totally gone from the storage.

Updating

Updating a row is no different, but it has two cases. The first case is in which not only the values, but also the key is modified. In this case, it is like removing the row with an old key and inserting a row with a new key. We already have seen both of these cases in detail. So, the operation should be obvious.

However, the case where only the values are modified is even simpler. We only have to locate the row from the indexes, load it in the memory if it is not already there, and modify. That’s all.

Scanning a range

This last operation is quite interesting. You can scan a range of keys from a starting key to an ending key. For instance, you can return all the rows that have a key greater than or equal to key1 and less than or equal to key2, effectively forming a range. Since the looking up of a single key is a fast operation, we only have to locate the first key of the range. Then, we start reading the consecutive keys one after the other till we encounter a key that is greater than key2, at which point, we will stop the scanning, and the keys that we scanned so far are our query’s result. This is how it looks like:

Name

Department

Company

Chris Harris

Research & Development

Google

Christopher Graham

Research & Development

LG

Debra Lee

Accounting

Sony

Ernest Morrison

Accounting

Apple

Fred Black

Research & Development

Sony

Janice Young

Research & Development

Google

Jennifer Sims

Research & Development

Panasonic

Joyce Garrett

Human Resources

Apple

Joyce Robinson

Research & Development

Apple

Judy Bishop

Human Resources

Google

Kathryn Crawford

Human Resources

Google

Kelly Bailey

Research & Development

LG

Lori Tucker

Human Resources

Sony

Nancy Campbell

Accounting

Sony

Nicole Martinez

Research & Development

LG

Norma Miller

Human Resources

Sony

Patrick Ward

Research & Development

Sony

Paula Harvey

Research & Development

LG

Stephanie Chavez

Accounting

Sony

Stephanie Mccoy

Human Resources

Panasonic

In the preceding table, we said that the starting key will be greater than or equal to Ernest and ending key will be less than or equal to Kathryn. So, we locate the first key that is greater than or equal to Ernest, which happens to be Ernest Morrison. Then, we start scanning further, picking and returning each key as long as it is less than or equal to Kathryn. When we reach Judy, it is less than or equal to Kathryn, but Kathryn isn’t. So, this row is not returned. However, the rows before this are returned. This is the last operation that is available to us on BigTable.

Selecting a key

Now that we have examined the data model and the storage layout, we are in a better position to talk about the key selection for a table. As we know that the stored data is sorted by the key, it does not impact the writing, deleting, and updating to fetch a single row. However, the operation that is impacted by the key is that of scanning a range.

Let’s think about the previous table again and assume that this table is a part of some system that processes payrolls for companies, and the companies pay us for the task of processing their payroll. Now, let’s suppose that Sony asks us to process their data and generate a payroll for them. Right now, we cannot do anything of this kind. We can just make our program scan the whole table, and hence all the records (which might be in millions), and only pick the records where job:company has the value of Sony. This would be inefficient.

Instead, what we can do is put this sorted nature of row keys to our service. Select the company name as the key and concatenate the designation and name along with it. So, the new table will look like this:

Key

Name

Department

Company

Apple-Accounting-Ernest Morrison

Ernest Morrison

Accounting

Apple

Apple-Human Resources-Joyce Garrett

Joyce Garrett

Human Resources

Apple

Apple-Research & Development-Joyce Robinson

Joyce Robinson

Research & Development

Apple

Google-Human Resources-Judy Bishop

Chris Harris

Research & Development

Google

Google-Human Resources-Kathryn Crawford

Janice Young

Research & Development

Google

Google-Research & Development-Chris Harris

Judy Bishop

Human Resources

Google

Google-Research & Development-Janice Young

Kathryn Crawford

Human Resources

Google

LG-Research & Development-Christopher Graham

Christopher Graham

Research & Development

LG

LG-Research & Development-Kelly Bailey

Kelly Bailey

Research & Development

LG

LG-Research & Development-Nicole Martinez

Nicole Martinez

Research & Development

LG

LG-Research & Development-Paula Harvey

Paula Harvey

Research & Development

LG

Panasonic-Human Resources-Stephanie Mccoy

Jennifer Sims

Research & Development

Panasonic

Panasonic-Research & Development-Jennifer Sims

Stephanie Mccoy

Human Resources

Panasonic

Sony-Accounting-Debra Lee

Debra Lee

Accounting

Sony

Sony-Accounting-Nancy Campbell

Fred Black

Research & Development

Sony

Sony-Accounting-Stephanie Chavez

Lori Tucker

Human Resources

Sony

Sony-Human Resources-Lori Tucker

Nancy Campbell

Accounting

Sony

Sony-Human Resources-Norma Miller

Norma Miller

Human Resources

Sony

Sony-Research & Development-Fred Black

Patrick Ward

Research & Development

Sony

Sony-Research & Development-Patrick Ward

Stephanie Chavez

Accounting

Sony

So, this is a new format. We just welded the company, department, and name as the key and as the table will always be sorted by the key, that’s what it looks like, as shown in the preceding table. Now, suppose that we receive a request from Google to process their data. All we have to do is perform a scan, starting from the key greater than or equal to Google and less then L because that’s the next letter. This scan is highlighted in the previous table.

Now, the next request is more specific. Sony asks us to process their data, but only for their accounting department. How do we do that? Quite simple! In this case, our starting key will be greater than or equal to Sony-Accounting, and the ending key can be Sony-Accountinga, where a is appended to indicate the end key in the range. The scanned range and the returned rows are highlighted in the previous table.

BigTable – a hands-on approach

Okay, enough of the theory. It is now time to take a break and perform some hands-on experimentation. By now, we know that about 80 percent of the BigTable and the other 20 percent of the complexity is scaling it to more than one machine. Our current discussion only assumed and focused on a single machine environment, and we assumed that the BigTable table is on our laptop and that’s about it.

You might really want to experiment with what you learned. Fortunately, given that you have the latest version of Google Chrome or Mozilla Firefox, that’s easy. You have BigTable right there! How? Let me explain.

Basically, from the ideas that we looked at pertaining to the stored key value, the sorted layout, the indexes of the sorted files, and all the operations that were performed on them, including scanning, we extracted a separate component called LevelDB. Meanwhile, as HTML was evolving towards HTML5, a need was felt to store data locally. Initially, SQLite3 was embedded in browsers, and there was a querying interface for you to play with. So all in all, you had an SQL database in the browser, which yielded a lot of possibilities. However, in recent years, W3C deprecated this specification and urged browser vendors to not implement it.

Instead of web databases that were based on SQLite3, they now have databases based on LevelDB that are actually key-value stores, where storage is always sorted by key. Hence, besides looking up for a key, you can scan across a range of keys.

Covering the IndexedDB API here would be beyond the scope of this book, but if you want to understand it and find out what the theory that we talked about looks like in practice, you can try using IndexedDB in your browser by visiting http://code.tutsplus.com/tutorials/working-with-indexeddb–net-34673.

The concepts of keys and the scanning of key ranges are exactly like those that we examined here as regards BigTable, and those about indexes are mainly from the concepts that we will examine in a later section about datastores.

Scaling BigTable to BigData

By now, you have probably understood the data model of BigTable, how it is laid out on the disk, and the advantages it offers. To recap once again, the BigTable installation may have many tables, each table may have many column families that are defined at the time of creating the table, and each column family may have many columns, as required. Rows are identified by keys, which have a maximum length of 64 KB, and the stored data is sorted by the key. We can receive, update, and delete a single row. We can also scan a range of rows from a starting key to an ending key.

So now, the question comes, how does this scale? We will provide a very high-level overview, neglecting the micro details to keep things simple and build a mental model that is useful to us as the consumers of BigTable, as we’re not supposed to clone BigTable’s implementation after all.

As we saw earlier, the basic storage unit in BigTable is a file format called SSTable that stores key-value pairs, which are sorted by the key, and has an index at its end. We also examined how the read, write, and delete work on an in-memory copy of the table and merged periodically with the table that is present on the disk. Lastly, we also mentioned that when the in memory is flushed as SSTables on the disk when reach a certain configurable count, they are merged into a bigger table.

The view so far presents the data model, its physical layout, and how operations work on it in cases where the data resides on a single machine, such as a situation where your laptop has a telephone directory of the entire Europe.

However, how does that work at larger scales? Neglecting the minor implementation details and complexities that arise in distributed systems, the overall architecture and working principles are simple. In case of a single machine, there’s only one SSTable (or a few in case they are not merged into one) file that has to be taken care of, and all the operations have to be performed on it. However, in case this file does not fit on a single machine, we will of course have to add another machine, and half of the SSTable will reside on one machine, while the other half will be on the another machine.

This split would of course mean that each machine would have a range of keys. For instance, if we have 1 million keys (that look like key1, key2, key3, and so on), then the keys from key1 to key500000 might be on one machine, while the keys from key500001 to key1000000 will be on the second machine. So, we can say that each machine has a different key range for the same table. Now, although the data resides on two different machines, it is of course a single table that sprawls over two machines. These partitions or separate parts are called tablets. Let’s see the Key allocation on two machines:

We will keep this system to only two machines and 1 million rows for the sake of discussion, but there may be cases where there are about 20 billion keys sprawling over some 12,000 machines, with each machine having a different range of keys. However, let’s continue with this small cluster consisting of only two nodes.

Now, the problem is that as an external user who has no knowledge of which machine has which portion of the SSTable (and eventually, the key ranges on each machine), how can a key, say, key489087 be located?

For this, we will have to add something like a telephone directory, where I look up the table name and my desired key and I get to know the machine that I should contact to get the data associated with the key. So, we are going to add another node, which will be called the master. This master will again contain simple, plain SSTable, which is familiar to us. However, the key-value pair would be a very interesting one. Since this table would contain data about the other BigTable tables, let’s call it the METADATA table. In the METADATA table, we will adopt the following format for the keys:

  • tablename_ending-row-key
  • Since we have only two machines and each machine has two tablets, the METADATA table will look like this:

Key

Value

employees_key500000

192.168.0.2

employees_key1000000

192.168.0.3

The master stores the location of each tablet server with the row key that is the encoding of the table name and the ending row of the tablet. So, the tablet has to be scanned. The master assigns tablets to different machines when required. Each tablet is about 100 MB to 200 MB in size. So, if we want to fetch a key, all we need to know is the following:

  • Location of the master server
  • Table in which we are looking for the key
  • The key itself

Now, we will concatenate the table name with the key and perform a scan on the METADATA table on the master node. Let’s suppose that we are looking for key600000 in employees table. So, we would first be actually looking for the employees_key600000 key in the table on master machine. As you are familiar with the scan operation on SSTable (and METADATA is just an SSTable), we are looking for a key that is greater than or equal to employees_key600000, which happens to be employees_key1000000. From this lookup, the key that we get is employees_key1000000 against which, IP address 192.168.0.3 is listed. This means that this is the machine that we should connect to fetch our data. We used the word keys and not the key because it is a range scan operation. This will be clearer with another example.

Let’s suppose that we want to process rows with keys starting from key400000 to key800000. Now, if you look at the distribution of data across the machine, you’ll know that half of the required range is on one machine, while the other half is on the other. Now, in this case, when we consult the METADATA table, two rows will be returned to us because key400000 is less then key500000 (which is the ending row key for data on the first machine) and key800000 is less then key1000000, which is the ending row for the data on the second machine. So, with these two rows returned, we have two locations to fetch our data from.

This leads to an interesting side-effect. As the data resides on two different machines, this can be read or processed in parallel, which leads to an improved system performance. This is one reason why even with larger datasets, the performance of BigTable won’t deteriorate as badly as it would have if it were a single, large machine with all the data on it.

The datastore thyself

So until now, everything that we talked about was about BigTable, and we did not mention datastore at all. Now is the time to look at datastore in detail because we understand BigTable quite well now. Datastore is an effectively solution that was built on top of BigTable as a persistent NoSQL layer for Google App Engine.

As we know that BigTable might have different tables, data for all the applications is stored in six separate tables, where each table stores a different aspect or information about the data. Don’t worry about memorizing things about data modeling and how to use it for now, as this is something that we are going to look into in greater detail later.

The fundamental unit of storage in datastore is called a property. You can think of a property as a column. So, a property has a name and type. You can group multiple properties into a Kind, which effectively is a Python class and analogous to a table in the RDBMS world. Here’s a pseudo code sample:

# 1. Define our Kind and how it looks like.
class Person(object):
   name = StringProperty()
   age = IntegerProperty()

# 2. Create an entity of kind person
ali = Person(name='Ali', age='24)
bob = Person(name='Bob', age='34)
david = Person(name='David', age='44)
zain = Person(name='Zain', age='54)
# 3. Save it
ali.put()
bob.put()
david.put()
zain.put()

This looks a lot like an ORM such as Django’s ORM, SQLAlchemy, or Rails ActiveRecord. So, Person class is called a Kind in App Engine’s terminology. The StringProperty and IntegerProperty property classes are used to indicate the type of the data that is supposed to be stored. We created an instance of the Person class as mohsin. This instance is called an entity in App Engine’s terminology. Each entity, when stored, has a key that is not only unique throughout your application, but also combined with your application ID. It becomes unique throughout all the applications that are hosted over Google App Engine.

All entities of all kinds for all apps are stored in a single BigTable, and they are stored in a way where all the property values are serialized and stored in a single BigTable column. Hence, no separate columns are defined for each property.

This is interesting and required as well because if we are Google App Engine’s architects, we do not know the Kind of data that people are going to store or the number and types of properties that they would define so that it makes sense to serialize the whole thing as one and store them in one column.

So, this is how it looks like:

Key

Kind

Data

agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM

Person

{name: ‘Ali’, age: 24}

agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM

Person

{name: ‘Bob’, age: 34}

agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM

Person

{name: ‘David’, age: 44}

agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM

Person

{name: ‘Zain’, age: 54}

The key appears to be random, but it is not. A key is formed by concatenating your application ID, your Kind name (Person here), and either a unique identifier that is auto generated by Google App Engine, or a string that is supplied by you.

The key seems cryptic, but it is not safe to pass it around in public, as someone might decode it and take advantage of it. Basically, it is just base 64 encoded and can easily be decoded to know the entity’s Kind name and ID. A better way would be to encrypt it using a secret key and then pass it around in public. On the other hand, to receive it, you will have to decrypt it using the same key. A gist of this is available on GitHub that can serve the purpose. To view this, visit https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68. However, for it to work, you need to edit your app.yaml file so that it includes the following:

libraries:
- name: pycrypto
   version: latest

Then, you can call the encrypt() method on the key while passing around and decrypt it back using the decrypt() method, as follows:

person = Person(name='peter', age=10)
key = person.put()
url_safe_key = key.urlsafe()
safe_to_pass_around = encrypt(SECRET_KEY, url_safe_key)

Now, when you have a key from the outside, you should first decrypt it and then use it, as follows:

key_from_outside = request.params.get('key')
url_safe_key = decrypt(SECRET_KEY, key_from_outside)
key = ndb.Key(urlsafe=url_safe_key)
person = key.get()

The key object is now good for use. To summarize, just get the URL safe key by calling the ndb.Key.urlsafe() method and encrypt it so that it can be passed around. On return, just do the reverse.

If you really want to see how the encrypt and decrypt operations are implemented, they are reproduced as follows without any documentation/comments, as cryptography is not our main subject:

import os
import base64
from Crypto.Cipher import AES

BLOCK_SIZE = 32
PADDING='#'

def _pad(data, pad_with=PADDING):
   return data + (BLOCK_SIZE - len(data) % BLOCK_SIZE) * PADDING

def encrypt(secret_key, data):
   cipher = AES.new(_pad(secret_key, '@')[:32])
   return base64.b64encode(cipher.encrypt(_pad(data)))

def decrypt(secret_key, encrypted_data):
   cipher = AES.new(_pad(secret_key, '@')[:32])
   return cipher.decrypt(base64.b64decode
     (encrypted_data)).rstrip(PADDING)

KEY='your-key-super-duper-secret-key-here-only-first-32-characters-are-used'
decrypted = encrypt(KEY, 'Hello, world!')
print decrypted
print decrypt(KEY, decrypted)

More explanation on how this works is given at https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68.

Now, let’s come back to our main subject, datastore. As you can see, all the data is stored in a single column, and if we want to query something, for instance, people who are older than 25, we have no way to do this. So, how will this work? Let’s examine this next.

Supporting queries

Now, what if we want to get information pertaining to all the people who are older than, say, 30? In the current scheme of things, this does not seem to be something that is doable, because the data is serialized and dumped, as shown in the previous table. Datastore solves this problem by putting the sorted values to be queried upon as keys. So here, we want to query by age. Datastore will create a record in another table called the Index table. This index table is nothing but just a plain BigTable, where the row keys are actually the property value that you want to query. Hence, a scan and a quick lookup is possible. Here’s how it would look like:

Key

Entity key

Myapp-person-age-24

agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM

Myapp-person-age-34

agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM

Myapp-person-age-44

agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM

Myapp-person-age-54

agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM

Implementation details

So, all in all, Datastore actually builds a NoSQL solution on top of BigTable by using the following six tables:

  • A table to store entities
  • A table to store entities by kind
  • A table to store indexes for the property values in the ascending order
  • A table to store indexes for the property values in the descending order
  • A table to store indexes for multiple properties together
  • A table to keep a track of the next unique ID for Kind

Let us look at each table in turn. The first table is used to store entities for all the applications. We have examined this in an example.

The second table just stores the Kind names. Nothing fancy here. It’s just some metadata that datastore maintains for itself. Think of this—you want to get all the entities that are of the Person Kind. How will you do this? If you look at the entities table alone and the operations that are available to us on a BigTable table, you will know that there’s no such way for us to fetch all the entities of a certain Kind. This table does exactly this. It looks like this:

Key

Entity key

Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM

AgtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM

Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854

agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854

Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM

agtkZXZ-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM

So, as you can see, this is just a simple BigTable table where the keys are of the [app ID]-[Kind name]-[entity key] pattern.

The tables 3, 4, and 5 from the six tables that were mentioned in the preceding list are similar to the table that we examined in the Supporting queries section labeled Data as stored in BigTable.

This leaves us with the last table. As you know that while storing entities, it is important to have a unique key for each row. Since all the entities from all the apps are stored in a single table, they should be unique across the whole table. When datastore generates a key for an entity that has to be stored, it combines your application ID and the Kind name of the entity. Now, this much part of the key only makes it unique across all the other entities in the table, but not within the set of your own entities. To do this, you need a number that should be appended to this. This is exactly similar to how AUTO INCREMENT works in the RDBMS world, where the value of a column is automatically incremented to ensure that it is unique. So, that’s exactly what the last table is for. It keeps a track of the last ID that was used by each Kind of each application, and it looks like this:

Key

Next ID

Myapp-Person

65

So, in this table, the key is of the [application ID]-[Kind name] format, and the value is the next value, which is 65 in this particular case. When a new entity of kind Person is created, it will be assigned 65 as the ID, and the row will have a new value of 66. Our application has only one Kind defined, which is Person. Therefore, there’s only one row in this table because we are only keeping track for the next ID for this Kind. If we had another Kind, say, Group, it will have its own row in this table.

Summary

We started this article with the problem of storing huge amounts of data, processing it in bulk, and randomly accessing it. This arose from the fact that we were ambitious to store every single web page on earth and process it to extract some results from it. We introduced a solution called BigTable and examined its data model. We saw that in BigTable, we can define multiple tables, with each table having multiple column families, which are defined at the time of creating the table. We learned that column families are logical groupings of columns, and new columns can be defined in a column family, as needed. We also learned that the data store in BigTable has no meaning on its own, and it stores them just as plain bytes; its interpretation and meanings depend on the user of data. We also learned that each row in BigTable has a unique row key, which has a length of 64 KB.

Lastly, we turned our attention to datastore, a NoSQL storage solution built on top of BigTable for Google App Engine. We briefly mentioned some datastore terminology such as properties (columns), entities (rows), and kinds (tables). We learned that all data is stored across six different BigTable tables. This captured a different aspect of data. Most importantly, we learned that all the entities of all the apps hosted on Google App Engine are stored in a single BigTable and all properties go to a single BigTable column. We also learned how querying is supported by additional tables that are keyed by the property values that list the corresponding keys.

This concludes our discussion on Google App Engine’s datastore and its underlying technology, workings, and related concepts. Next, we will learn how to model our data on top of datastore. What we learned in this chapter will help us enormously in understanding how to better model our data to take full advantage of the underlying mechanisms.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here