What does the structure of a data mining architecture look like?

Any good data mining project is built on a robust data mining architecture. Without it, your project might well be time-consuming, overly complicated or simply inaccurate. Whether you're new to data mining or want to re-familiarize yourself with what the structure of a data mining architecture should look like, you've come to the right place. Of course, this is just a guide to what a data mining architecture should look like. You'll need to be aware of how this translates to your needs and situation.

This has been taken from Data Mining with R. Find it here.

The core components of a data mining architecture

Let's first gain a general view on the main components of a data mining architecture. It is basically composed of all of the basic elements you will need to perform the activities described in the previous chapter. As a minimum set of components, the following are usually considered:

Data sources

Data warehouse

Data mining engine

User interface

Below is a diagram of a data mining architecture. You can see how each of the elements fit together:

what-does-the-structure-of-a-data-mining-architecture-look-like-img-0

Before we get into the details of each of the components of a data mining architecture, let's first briefly look at how these components fit together:

Data sources: These are all the possible sources of small bits of information to be analyzed. Data sources feed our data warehouses and are fed by the data produced from our activity toward the user interface.

Data warehouse: This is where the data is stored when acquired from data sources.
Data mining engine: This contains all of the logic and the processes needed to perform the actual data mining activity, taking data from the data warehouse.

User interface: The front office of our machine, which allows the user to interact with the data mining engine, creating data that will be stored within the data warehouse and that could become part of the big ocean of data sources.

We'll now delve a little deeper into each of these elements, starting with data sources.

How data sources fit inside the data mining architecture

Data sources are everywhere. This is becoming more and more true everyday thanks to the the internet of things. Now that every kind of object can be connected to the internet, we can collect data from a huge range of new physical sources. This data can come in a form already feasible for being collected and stored within our databases, or in a form that needs to be further modified to become usable for our analyses.

We can, therefore, see that between our data sources and the physical data warehouse where they are going to be stored, a small components lies, which is the set of tools and software needed to make data coming from sources storable.

We should note something here—we are not talking about data cleaning and data validation. Those activities will be performed later on by our data mining engine which retrieves data from the data warehouse.

Types of data sources

There are a range of data sources. Each type will require different data modelling techniques. Getting this wrong could seriously hamper your data mining projects, so an awareness of how data sources differ is actually really important.

Unstructured data sources

Unstructured data sources are data sources missing a logical data model. Whenever you find a data source where no particular logic and structure is defined to collect, store, and expose it, you are dealing with an unstructured data source. The most obvious example of an unstructured data source is a written document. That document has a lot of information in it, but there's no structure that defines and codifies how information is stored.

There are some data modeling techniques that can be useful here. There are some that can even derive structured data from unstructured data. This kind of analysis is becoming increasingly popular as companies seek to use 'social listening' to understand sentiment on social media.

Structured data sources

Structured data sources are highly organized. These kinds of data sources follow a specific data model, and the engine which makes the storing activity is programmed to respect this model.

A well-known data model behind structured data is the so-called relational model of data. Following this model, each table has to represent an entity within the considered universe of analysis. Each entity will then have a specific attribute within each column, and a related observation within each row. Finally, each entity can be related to the others through key attributes.

We can think of an example of a relational database of a small factory. Within this database, we have a table recording all customers orders and one table recording all shipments. Finally, a table recording the warehouse's movements will be included.

Within this database, we will have:

The warehouse table linked to the shipment table through the product_code attribute

The shipment table linked to the customer table through the shipment_code attribute

It can be easily seen that a relevant advantage of this model is the possibility to easily perform queries within tables, and merges between them. The cost to analyze structured data is far lower than the one to be considered when dealing with unstructured data.

Key issues of data sources

When dealing with data sources and planning their acquisition into your data warehouse, some specific aspects need to be considered:

Frequency of feeding: Is the data updated with a frequency feasible for the scope of your data mining activity?

Volume of data: Can the volume of data be handled by your system, or it is too much? This is often the case for unstructured data, which tends to occupy more space for a given piece of information.

Data format: Is the data format readable by your data warehouse solution, and subsequently, by your data mining engine?

A careful evaluation of these three aspects has to be performed before implementing the data acquisition phase, to avoid relevant problems during the project.

How databases and data warehouses fit in the data mining architecture

What is a data warehouse, and how is it different from a simple database?

A data warehouse is a software solution aimed at storing usually great amounts of data properly related among them and indexed through a time-related index. We can better understand this by looking at the data warehouse's cousin: the operational database.

These kinds of instruments are usually of small dimensions, and aimed at storing and inquiring data, overwriting old data when new data is available. Data warehouses are therefore usually fed by databases, and stores data from those kinds of sources ensuring a historical depth to them and read-only access from other users and software applications. Moreover, data warehouses are usually employed at a company level, to store, and make available, data from (and to) all company processes, while databases are usually related to one specific process or task.

How do you use a data warehouse for your data mining project?

You're probably not going to use a data warehouse for your data mining process. More specicially, data will be made available via a data mart. A data mart is a partition or a sub-element of a data warehouse. The data marts are set of data that are feed directly from the data warehouse, and related to a specific company area or process. A real-life example is the data mart created to store data related to default events for the purpose of modeling customers probability of default.

This kind of data mart will collect data from different tables within the data warehouse, properly joining them into new tables that will not communicate with the data warehouse one. We can therefore consider the data mart as an extension of the data warehouse.

Data warehouses are usually classified into three main categories:

One-level architecture where only a simple database is available and the data warehousing activity is performed by the mean of a virtual component

Two-level architecture composed of a group of operational databases that are related to different activities, and a proper data warehouse is available

Three-level architecture with one or more operational database, a reconciled database and a proper data warehouse

Let's now have a closer look to those three different types of data warehouse.

One-level database

This is for sure the most simple and, in a way, primitive model. Within one level data warehouses, we actually have just one operational database, where data is written and read, mixing those two kinds of activities. A virtual data warehouse layer is then offered to perform inquiry activities. This is a primitive model for the simple reason that it is not able to warrant the appropriate level of segregation between live data, which is the one currently produced from the process, and historical data. This model could therefore produce inaccurate data and even a data loss episode.

This model would be particularly dangerous for data mining activity, since it would not ensure a clear segregation between the development environment and the production one.

Two-level database

This more sophisticated model encompasses a first level of operational databases, for instance, the one employed within marketing, production, and accounting processes, and a proper data warehouse environment. Within this solution, the databases are to be considered like feeding data sources, where the data is produced, possibly validated, and then made available to the data warehouse.

The data warehouse will then store and freeze data coming from databases, for instance, with a daily frequency.

Every set of data stored within a day will be labeled with a proper attribute showing the date of record. This will later allow us to retrieve records related to a specific time period in a sort of time machine functionality. Going back to our previous probability of default example, this kind of functionality will allow us to retrieve all default events that have occurred within a given time period, constituting the estimation sample for our model.

Two-level architecture is an optimal solution for data mining processes, since they will allow us to provide a safe environment, the previously mentioned data mart, to develop data mining activity, without compromising the quality of data residing within the remaining data warehouses and within the operational databases.

Three-level database

Three-level databases are the most advanced ones. The main difference between them and the two-level ones is the presence of the reconciliation stage, which is performed through Extraction, Transformation, and Load (ETL) instruments. To understand the relevance of such kinds of instruments, we can resort to a practical example once again, and to the one we were taking advantage of some lines previously: the probability of the default model.

Imagine we are estimating such kind of model for customers clustered as large corporate, for which public forecasts, outlooks and ratings are made available by financial analyses companies like Moody's, Standard & Poor, and similar.

Since this data could be reasonably related to the probability of default of our customers, we would probably be interested in adding them to our estimation database. This can be easily done through the mean of those ETL instruments. These instruments will ensure, within the reconciliation stage, that data gathered from internal sources, such as personal data and default events data, will be properly matched with the external information we have mentioned.

Moreover, even within internal data fields only, those instruments will ensure the needed level of quality and coherence among different sources, at least within the data warehouse environment.

Data warehouse technologies

We are now going to look a bit more closely at the actual technology - most of which is open source. A proper awareness of their existence and main features should be enough, since you will usually be taking input data from them through an interface provided by your programming language.

Nevertheless, knowing what's under the hood is pretty useful...

SQL

SQL stands for Structured Query Language, and identifies what has been for many years the standard within the field of data storage. The base for this programming language, employed for storing and querying data, are the so-called relational data bases. The theory behind these data bases was first introduced by IBM engineer Edgar F. Codd, and is based on the following main elements:

Tables, each of which represent an entity

Columns, each of which represent an attribute of the entity

Rows, each one representing a record of the entity

Key attributes, which permit us to relate two or more tables together, establishing relations between them

Starting from these main elements, SQL language provides a concise and effective way to query and retrieve this data. Moreover, basilar data munging operations, such as table merging and filtering, are possible through SQL language.

As previously mentioned, SQL and relational databases have formed the vast majority of data warehouse systems around the world for many, many years. A really famous example of SQL-based data storing products is the well-known Microsoft Access software. In this software, behind the familiar user interface, hide SQL codes to store, update, and retrieve user's data.

MongoDB

While SQL-based products are still very popular, NoSQL technology has been going for a long time now, showing its relevance and effectiveness. Behind this acronym stands all data storing and managing solutions not based on the relational paradigm and its main elements. Among this is the document-oriented paradigm, where data is represented as documents, which are complex virtual objects identified with some kind of code, and without a fixed scheme.

A popular product developed following this paradigm is MongoDB. This product stores data, representing it in the JSON format. Data is therefore organized into documents and collections, that is, a set of documents. A basic example of a document is the following:

{
name: "donald" , surname: "duck",
style: "sailor",
friends: ["mickey mouse" , "goofy", "daisy"]
}

As you can see, even from this basic example, the MongoDB paradigm will allow you to easily store data even with a rich and complex structure.

Hadoop

Hadoop is a leading technology within the field of data warehouse systems, mainly due to its ability to effectively handle large amounts of data. To maintain this ability, Hadoop fully exploits the concept of parallel computing by means of a central master that divides the all needed data related to a job into smaller chunks to be sent to two or more slaves. Those slaves are to be considered as nodes within a network, each of them working separately and locally. They can actually be physically separated pieces of hardware, but even core within a CPU (which is usually considered pseudo-parallel mode).

At the heart of Hadoop is the MapReduce programming model. This model, originally conceptualized by Google, consists of a processing layer, and is responsible for moving the data mining activity close to where data resides. This minimizes the time and cost needed to perform computation, allowing for the possibility to scale the process to hundreds and hundreds of different nodes.

Read next: Why choose R for your data mining project [link]

The data mining engine that drives a data mining architecture

The data mining engine is the true heart of our data mining architecture. It consists of tools and software employed to gain insights and knowledge from data acquired from data sources, and stored within data warehouses.

What makes a data mining engine?

As you should be able to imagine at this point, a good data mining engine is composed of at least three components:

An interpreter, able to transmit commands defined within the data mining engine to the computer

Some kind of gear between the engine and the data warehouse to produce and handle communication in both directions

A set of instructions, or algorithms, needed to perform data mining activities

Let's take a look at these components in a little more detail.

The interpreter

The interpreter carries out instructions coming from a higher-level programming language, and then translates them into instructions understandable from the piece of hardware it is running on, and transmits them to it. Obtaining the interpreter for the language you are going to perform data mining with is usually as simple as obtaining the language itself. In the case of our beloved R language, installing the language will automatically install the interpreter as well.

The interface between the engine and the data warehouse

If the interpreter was previously introduced, this interface we are talking about within this section is a new character within our story. The interface we are talking about here is a kind of software that enables your programming language to talk with the data warehouse solution you have been provided with for your data mining project.

To exemplify the concept, let's consider a setup adopting as a data mining engine, a bunch of R scripts, with their related interpreter, while employing an SQL-based database to store data. In this case, what would be the interface between the engine and the data warehouse?

It could be, for instance, the RODBC package, which is a well-established package designed to let R users connect to remote servers, and transfer data from those servers to their R session. By employing this package, it will also be possible to write data to your data warehouse.

This packages works exactly like a gear between the R environment and the SQL database. This means you will write your R code, which will then be translated into a readable language from the database and sent to him.

For sure, this translation also works the other way, meaning that results coming from your instructions, such as new tables of results from a query, will be formatted in a way that's readable from the R environment and conveniently shown to the user.

The data mining algorithms

This last element of the engine is the actual core topic of the book you are reading—the data mining algorithms. To help you gain an organic and systematic view of what we have learned so far, we can consider that these algorithms will be the result of the data modelling phase described in the previous chapter in the context of the CRISP-DM methodology description. This will usually not include code needed to perform basic data validation treatments, such as integrity checking and massive merging among data from different sources, since those kind of activities will be performed within the data warehouse environment. This will be especially true in cases of three-level data warehouses, which have a dedicated reconciliation layer.

The user interface - the bit that makes the data mining architecture accessible

Until now, we have been looking at the back office of our data mining architecture, which is the part not directly visible to its end user. Imagine this architecture is provided to be employed by someone not skilled enough to work on the data mining engine itself; we will need some way to let this user interact with the architecture in the right way, and discover the results of its interaction. This is what a user interface is all about.

Clarity and simplicity

There's a lot to be said about UI design that site more in the field of design than data analysis. Clearly, those fields are getting blurred as data mining becomes more popular, and as 'self-service' analytics grows as a trend.

However, the fundamental elements of a UI is clarity and simplicity. What this means is that it is designed with purpose and usage in mind. What do you want to see? What do you want to be able to do with your data?

Ask yourself this question: how many steps you need to perform to reach the objective you want to reach with the product? Imagine evaluating a data mining tool, and particularly, its data import feature. Evaluating the efficiency of the tool in this regard would involve answering the following question: how many steps do I need to perform to import a dataset into my data mining environment?

Every piece is important in the data mining architecture

When it comes to data mining architecture, it's essential that you don't overlook either part of it. Every component is essential. Of course, like any other data mining project, understanding what your needs are - and the needs of those in your organization - are going to inform how you build each part. But fundamentally the principles behind a robust and reliable data mining architecture will always remain the same.

Read more: Expanding Your Data Mining Toolbox [link]