6 min read

(For more resources on SimpleDB, see here.)

We have all used a Relational Database Management System (RDBMS) at some point in our careers. These relational databases are ubiquitous and are available from a wide range of companies such as Oracle, Microsoft, IBM, and so on. These databases have served us well for our application needs. However, there is a new breed of applications coming to the forefront in the current Internet-driven and socially networked economy. The new applications require large scaling to meet demand peaks that can quickly reach massive levels. This is a scenario that is hard to satisfy using a traditional relational database, as it is impossible to requisition and provision the hardware and software resources that will be needed to service the demand peaks. It is also non-trivial and difficult to scale a normal RDBMS to hundreds or thousands of nodes. The overwhelming complexity of doing this makes the RDBMS not viable for these kinds of applications. SimpleDB provides a great alternative to an RDBMS and can provide a solution to all these problems. However, in order to provide this solution, SimpleDB makes some choices and design decisions that you need to understand in order to make an informed choice about the data storage for your application domain.

No normalization

Normalization is a process of organizing data efficiently in a relational database by eliminating redundant data, while at the same time ensuring that the data dependencies make sense. SimpleDB data models do not conform to any of the normalization forms, and tend to be completely de-normalized. The lack of need for normalization in SimpleDB allows you a great deal of flexibility with your model, and enables you to use the power of multi-valued attributes in your data.

Let’s look at a simple example of a database starting with a basic spreadsheet structure and then design it for an RDBMS and a SimpleDB. In this example, we will create a simple contact database, with contact information as raw data.

ID

First_Name

Last_Name

Phone_Num

101

John

Smith

555-845-7854

101

John

Smith

555-854-9885

101

John

Smith

555-695-7485

102

Bill

Jones

555-748-7854

102

Bill

Jones

555-874-8654

The obvious issue is the repetition of the name data. The table is inefficient and would require care to update to keep the name data in sync. To find a person by his or her phone number is easy.

SELECT * FROM Contact_Info WHERE Phone_Num = '555-854-9885'

So let’s analyze the strengths and weaknesses of this database design.

SCORE-Raw data

Strength

Weakness

Efficient storage

 

No

Efficient search by phone number

Yes

 

Efficient search by name

 

No

Easy-to-add another phone number

Yes

 

The design is simple, but as the name data is repeated, it would require care to keep the data in sync. Searching for phone numbers by name would be ugly if the names got out of sync.

To improve the design, we can rationalize the data. One approach would be to create multiple phone number fields such as the following. While this is a simple solution, it does limit the phone numbers to three. Add e-mail and Twitter, and the table becomes wider and wider.

ID

First_Name

Last_Name

Phone_Num_1

Phone_Num_2

Phone_Num_3

101

John

Smith

555-845-7854

555-854-9885

555-695-7485

102

Bill

Jones

555-748-7854

555-874-8654

 

Finding a person by a phone number is ugly.

SELECT * FROM Contact_Info WHERE Phone_Num_1 = '555-854-9885'
OR Phone_Num_2 = '555-854-9885'
OR Phone_Num_3 = '555-854-9885'

Now let’s analyze the strengths and weaknesses of this database design.

SCORE-Rationalize data

Strength

Weakness

Efficient storage

Yes

 

Efficient search by phone number

 

No

Efficient search by name

Yes

 

Easy to add another phone number

 

No

The design is simple, but the phone numbers are limited to three, and searching by phone number involves three index searches.

Another approach would be to use a delimited list for the phone number as follows:

ID

First_Name

Last_Name

Phone_Nums

101

John

Smith

555-845-7854;555-854-9885;555-695-7485

102

Bill

Jones

555-748-7854;555-874-8654

This approach has the advantage of no data repetition and is easy to maintain, compact, and extendable, but the only way to find a record by the phone number is with a substring search.

SELECT * FROM Contact_Info WHERE Phone_Nums LIKE %555-854-9885%

This type of SQL forces a complete table scan. Do this with a small table and no one will notice, but try this on a large database with millions of records, and the performance of the database will suffer.

SCORE-Delimited Data

Strength

Weakness

Efficient storage

Yes

 

Efficient search by phone number

 

No

Efficient search by name

Yes

 

Easy to add another phone number

Yes

 

A delimited field is good for data that is of one type and will only be retrieved.

The normalization for relational databases results in splitting up your data into separate tables that are related to one another by keys. A join is an operation that allows you to retrieve the data back easily across the multiple tables.

Let’s first normalize this data.

This is the Person_Info table:

ID

First_Name

Last_Name

101

John

Smith

102

Bill

Jones

And this is the Phone_Info table:

ID

Phone_Num

101

555-845-7854

101

555-854-9885

101

555-695-7485

102

555-748-7854

102

555-874-8654

Now a join of the Person_Info table with the Phone_Info can retrieve the list of phone numbers as well as the e-mail addresses. The table structure is clean and other than the ID primary key, no data is duplicated. Provided Phone_Num is indexed, retrieving a contact by the phone number is efficient.

SELECT First_Name, Last_Name, Phone_num, Person_Info.ID
FROM Person_Info JOIN Phone_Info
ON Person_Info.ID = Phone_Info.ID
WHERE Phone_Num = '555-854-9885'

So if we analyze the strengths and weaknesses of this database design, we get:

SCORE-Relational Data

Strength

Weakness

Efficient storage

Yes

 

Efficient search by phone number

Yes

 

Efficient search by name

Yes

 

Easy to add another phone number

Yes

 

While this is an efficient relational model, there is no join command in SimpleDB. Using two tables would force two selects to retrieve the complete contact information. Let’s look at how this would be done using the SimpleDB principles.

No joins

SimpleDB does not support the concept of joins. Instead, SimpleDB provides you with the ability to store multiple values for an attribute, thus avoiding the necessity to perform a join to retrieve all the values.

ID

 

 

 

101

First_Name=John

Last_Name=Smith

Phone_Num =555-845-7854
Phone_Num =555-854-9885
Phone_Num =555-695-7485

102

First_Name=Bill

Last_Name=Jones

Phone_Num =555-748-7854
Phone_Num =555-874-8654

In the SimpleDB table, each record is stored as an item with attribute/value pairs. The difference here is that the Phone_Num field has multiple values. Unlike a delimited list field, SimpleDB indexes all values enabling an efficient search each value.

SELECT * FROM Contact_Info WHERE Phone_Num = '555-854-9885'

This SELECT is very quick and efficient. It is even possible to use Phone_Num multiple times such as follows:

SELECT * FROM Contact_Info WHERE Phone_Num = '555-854-9885'
OR Phone_Num = '555-748-7854'

Let’s analyze the strengths and weaknesses of this approach:

SCORE-SimpleDB Data

Strength

Weakness

Efficient storage

Yes

 

Efficient search by phone number

Yes

 

Efficient search by name

Yes

 

Easy to add another phone number

Yes

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here