Today we explore popular data analytics methods such as Microsoft TDSP Process and the CRISP- DM methodology.
There is an increasing amount of data in the world, and in our databases. The data deluge is not going to go away anytime soon! Businesses risk wasting the useful business value of information contained in databases, unless they are able to excise useful knowledge from the data. It can be hard to know how to get started. Fortunately, there are a number of frameworks in data science that help us to work our way through an analytics project. Processes such as Microsoft Team Data Science Process (TDSP) and CRISP-DM position analytics as a repeatable process that is part of a bigger vision.
Why are they important? The Microsoft TDSP Process and the CRISP-DM frameworks are frameworks for analytics projects that lead to standardized delivery for organizations, both large and small. In this chapter, we will look at these frameworks in more detail, and see how they can inform our own analytics projects and drive collaboration between teams. How can we have the analysis shaped so that it follows a pattern so that data cleansing is included?
Industry standard methodologies for analytics
There are a few main methodologies: the Microsoft TDSP Process and the CRISP-DM Methodology. Ultimately, they are all setting out to achieve the same objectives as an analytics framework. There are differences, of course, and these are highlighted here.
CRISP-DM and TDSP focus on the business value and the results derived from analytics projects. Both of these methodologies are described in the following sections.
One common methodology is the CRISP-DM methodology (the modeling agency). The Cross Industry Standard Process for Data Mining or (CRISP-DM) model as it is known, is a process model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions. The process is loosely divided into six main phases. The phases can be seen in the following diagram:
Initially, the process starts with a business idea and a general consideration of the data. Each stage is briefly discussed in the following sections.
Business understanding/data understanding
The first phase looks at the machine learning solution from a business standpoint, rather than a technical standpoint. The business idea is defined, and a draft project plan is generated. Once the business idea is defined, the data understanding phase focuses on data collection and familiarity. At this point, missing data may be identified, or initial insights may be revealed. This process feeds back to the business understanding phase.
CRISP-DM model — data preparation
In this stage, data will be cleansed and transformed, and it will be shaped ready for the modeling phase.
CRISP-DM — modeling phase
In the modeling phase, various techniques are applied to the data. The models are further tweaked and refined, and this may involve going back to the data preparation phase in order to correct any unexpected issues.
CRISP-DM — evaluation
The models need to be tested and verified to ensure that they meet the business objectives that were defined initially in the business understanding phase. Otherwise, we may have built models that do not answer the business question.
CRISP-DM — deployment
The models are published so that the customer can make use of them. This is not the end of the story, however.
CRISP-DM — process restarted
We live in a world of ever-changing data, business requirements, customer needs, and environments, and the process will be repeated.
CRISP-DM is the most commonly used framework for implementing machine learning projects specifically, and it applies to analytics projects as well. It has a good focus on the business understanding piece. However, one major drawback is that the model no longer seems to be actively maintained. The official site, CRISP-DM.org, is no longer being maintained. Furthermore, the framework itself has not been updated on issues on working with new technologies, such as big data. Big data technologies means that there can be additional effort spend in the data understanding phase, for example, as the business grapples with the additional complexities that are involved in the shape of big data sources.
Team Data Science Process
The TDSP process model provides a dynamic framework to machine learning solutions that have been through a robust process of planning, producing, constructing, testing, and deploying models. Here is an example of the TDSP process:
The process is loosely divided into four main phases:
- Business Understanding
- Data Acquisition and Understanding
The phases are described in the following paragraphs.
The Business understanding process starts with a business idea, which is solved with a machine learning solution. The business idea is defined from the business perspective, and possible scenarios are identified and evaluated. Ultimately, a project plan is generated for delivering the solution.
Data acquisition and understanding
Following on from the business understanding phase is the data acquisition and understanding phase, which concentrates on familiarity and fact-finding about the data. The process itself is not completely linear; the output of the data acquisition and understanding phase can feed back to the business understanding phase, for example. At this point, some of the essential technical pieces start to appear, such as connecting to data, and the integration of multiple data sources. From the user’s perspective, there may be actions arising from this effort. For example, it may be noted that there is missing data from the dataset, which requires further investigation before the project proceeds further.
In the modeling phase of the TDSP process, the R model is created, built, and verified against the original business question. In light of the business question, the model needs to make sense. It should also add business value, for example, by performing better than the existing solution that was in place prior to the new R model. This stage also involves examining key metrics in evaluating our R models, which need to be tested to ensure that the models meet the original business objectives set out in the initial business understanding phase.
R models are published to production, once they are proven to be a fit solution to the original business question. This phase involves the creation of a strategy for ongoing review of the R model’s performance as well as a monitoring and maintenance plan. It is recommended to carry out a recurrent evaluation of the deployed models. The models will live in a fluid, dynamic world of data and, over time, this environment will impact their efficacy.
The TDSP process is a cycle rather than a linear process, and it does not finish, even if the model is deployed. It is comprised of a clear structure for you to follow throughout the Data Science process, and it facilitates teamwork and collaboration along the way.
The data science unicorn does not exist; that is, the person who is equally skilled in all areas of data science, right across the board. In order to ensure successful projects where each team player contributes according to their skill set, the Team Data Science Summary is a team-oriented solution that emphasizes teamwork and collaboration throughout. It recognizes the importance of working as part of a team to deliver Data Science projects. It also offers useful information on the importance of having standardized source control and backups, which can include open source technology.
If you liked our post, please be sure to check out Advanced Analytics with R and Tableau that shows how to apply various data analytics techniques in R and Tableau across the different stages of a data science project highlighted in this article.