|Read more about this book|
When it comes to the design of a data warehouse, there is basically one option that makes the most sense for how we will structure our database and that is the dimensional model. This is a way of looking at the data from a business perspective that makes the data simple, understandable, and easy to query for the business end user. It doesn’t require a database administrator to be able to retrieve data from it.
A normalized model removes redundancies in data by storing information in discrete tables, and then referencing those tables when needed. This has an advantage for a transactional system because information needs to be entered at only one place in the database, without duplicating any information already entered. For example, in the ACME Toys and Gizmos transactional database, each time a transaction is recorded for the sale of an item at a register, a record needs to be added only to the transactions table. In the table, all details regarding the information to identify the register, the item information, and the employee who processed the transaction do not need to be entered because that information is already stored in separate tables. The main transaction record just needs to be entered with references to all that other information.
This works extremely well for a transactional type of system concerned with daily operational processing where the focus is on getting data into the system. However, it does not work well for a data warehouse whose focus is on getting data out of the system. Users do not want to navigate through the spider web of tables that compose a normalized database model to extract the information they need. Therefore, dimensional models were introduced to provide the end user with a flattened structure of easily queried tables that he or she can understand from a business perspective.
A dimensional model takes the business rules of our organization and represents them in the database in a more understandable way. A business manager looking at sales data is naturally going to think more along the lines of “How many gizmos did I sell last month in all stores in the south and how does that compare to how many I sold in the same month last year?” Managers just want to know what the result is, and don’t want to worry about how many tables need to be joined in a complex query to get that result. A dimensional model removes the complexity and represents the data in a way that end users can relate to it more easily from a business perspective.
Users can intuitively think of the data for the above question as a cube, and the edges (or dimensions) of the cube labeled as stores, products, and time frame. So let’s take a look at this concept of a cube with dimensions, and how we can use that to represent our data.
Cube and dimensions
The dimensions become the business characteristics about the sales, for example:
- A time dimension—users can look back in time and perform time series analysis, such as how a quarter compares to the same quarter last year
- A store dimension—information can be retrieved by store and location
- A product dimension—various products for sale can be broken out
Think of the dimensions as the edges of a cube, and the intersection of the dimensions as the measure we are interested in for that particular combination of time, store, and product. A picture is worth a thousand words, so let’s look at what we’re talking about in the following image:
Notice what this cube looks like. How about a Rubik’s Cube? We’re doing a data warehouse for a toy store company, so we ought to know what a Rubik’s cube is! If you have one, maybe you should go get it now because that will exactly model what we’re talking about. Think of the width of the cube, or a row going across, as the product dimension. Every piece of information or measure in the same row refers to the same product, so there are as many rows in the cube as there are products. Think of the height of the cube, or a column going up and down, as the store dimension. Every piece of information in a column represents one single store, so there are as many columns as there are stores. Finally, think of the depth of the cube as the time dimension, so any piece of information in the rows and columns at the same depth represent the same point in time. The intersection of each of these three dimensions locates a single individual cube in the big cube, and that represents the measure amount we’re interested in. In this case, it’s dollar sales for a single product in a single store at a single point in time.
But one might wonder if we are restricted to just three dimensions with this model. After all, a cube has only three dimensions—length, width, and depth. Well, the answer is no. We can have many more dimensions than just three. In our ACME example, we might want to know the sales each employee has accomplished for the day. This would mean we would need a fourth dimension for employees. But what about our visualization above using a cube? How is this fourth dimension going to be modeled? And no, the answer is not that we’re entering the Twilight Zone here with that “dimension not only of sight and sound but of mind…” We can think of additional dimensions as being cubes within a cube. If we think of an individual intersection of the three dimensions of the cube as being another cube, we can see that we’ve just opened up another three dimensions to use—the three for that inner cube. The Rubik’s Cube example used above is good because it is literally a cube of cubes and illustrates exactly what we’re talking about.
We do not need to model additional cubes. The concept of cubes within cubes was just to provide a way to visualize further dimensions. We just model our main cube, add as many dimensions as we need to describe the measures, and leave it for the implementation to handle.
This is a very intuitive way for users to look at the design of the data warehouse. When it’s implemented in a database, it becomes easy for users to query the information from it.
Implementation of a dimensional model in a database
We have seen how a dimensional model is preferred over a normalized model for designing a data warehouse. Now before we finalize our model for the ACME Toys and Gizmos data warehouse, let’s look at the implementation of the model to see how it gets physically represented in the database. There are two options: a relational implementation and a multidimensional implementation. The relational implementation, which is the most common for a data warehouse structure, is implemented in the database with tables and foreign keys. The multidimensional implementation requires a special feature in a database that allows defining cubes directly as objects in the database. Let’s discuss a few more details of these two implementations.
Relational implementation (star schema)
The term relational is used because the tables in it relate to each other in some way. We can’t have a POS transaction without the corresponding register it was processed on, so those two relate to each other when represented in the database as tables.
For a relational data warehouse design, the relational characteristics are retained between tables. But a design principle is followed to keep the number of levels of foreign key relationships to a minimum. It’s much faster and easier to understand if we don’t have to include multiple levels of referenced tables. For this reason, a data warehouse dimensional design that is represented relationally in the database will have one main table to hold the primary facts, or measures we want to store, such as count of items sold or dollar amount of sales. It will also hold descriptive information about those measures that places them in context, contained in tables that are accessed by the main table using foreign keys. The important principle here is that these tables that are referenced by the main table contain all the information they need and do not need to go down any more levels to further reference any other tables.
The ER diagram of such an implementation would be shaped somewhat like a star, and thus the term star schema is used to refer to this kind of an implementation. The main table in the middle is referred to as the fact table because it holds the facts, or measures that we are interested in about our organization. This represents the cube that we discussed earlier. The tables surrounding the fact table are known as dimension tables. These are the dimensions of our cube. These tables contain descriptive information, which places the facts in a context that makes them understandable. We can’t have a dollar amount of sales that means much to us unless we know what item it was for, or what store made the sale, or any of a number of other pieces of descriptive information that we might want to know about it.
It is the job of data warehouse design to determine what pieces of information need to be included. We’ll then design dimension tables to hold the information. Using the dimensions we referred to above in our cube discussion as our dimension tables, we have the following diagram that illustrates a star schema:
Of course our star only has three points, but with a much larger data warehouse of many more dimensions, it would be even more star-like. Keep in mind the principle that we want to follow here of not using any more than one level of foreign key referencing. As a result, we are going to end up with a de-normalized database structure. For a data warehouse, the query time and simplicity is of paramount importance over the duplication of data. As for the data accuracy, it’s a read-only database so we can take care of that up front when we load the data. For these reasons, we will want to include all the information we need right in the dimension tables, rather than create further levels of foreign key references. This is the opposite of normalization, and thus the term de-normalized is used.
Let’s look at an example of this for ACME Toys and Gizmos to get a better idea of what we’re talking about with this concept of de-normalization. Every product in our stores is associated with a department. If we have a dimension for product information, one of the pieces of information about the product would be the department it is in. In a normalized database, we would consider creating a department table to store department descriptions with one row for each department, and would use a short key code to refer to the department record in the product table.
However, in our data warehouse, we would include that department information, description and all, right in the product dimension. This will result in the same information being duplicated for each product in the department. What that buys us is a simpler structure that is easier to query and more efficient for retrieving information from, which is key to data warehouse usability. The extra space we consume in repeating the information is more than paid for in the improvement in speed and ease of querying the information. That will result in a greater acceptance of the data warehouse by the user community who now find it more intuitive and easier to retrieve their data.
In general, we will want to de-normalize our data warehouse implementation in all cases, but there is the possibility that we might want to include another level—basically a dimension table referenced by another dimension table. In most cases, we will not need nor want to do this and instances should be kept to an absolute minimum; but there are some cases where it might make sense.
This is a variation of the star schema referred to as a snowflake schema because with this type of implementation, dimension tables are partially normalized to pull common data out into secondary dimension tables. The resulting schema diagram looks somewhat like a snowflake. The secondary dimension tables are the tips of the snowflake hanging off the main dimension tables in a star schema.
In reality, we’d want at the most only one or two of the secondary dimension tables; but it serves to illustrate the point. A snowflake dimension table is really not recommended in most cases because of ease-of-use and performance considerations, but can be used in very limited circumstances.
Let’s now talk a little bit about the multidimensional implementation of a dimensional model in the database, and then we’ll design our cube and dimensions specifically for the ACME Toys and Gizmos Company data warehouse.