Explaining Data Exploration in under a minute

[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box]

Today, we shall explore different data exploration techniques and a real world example of using these techniques.

Introduction

Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows:

Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration?
Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them.
Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results.
Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way.
Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model.
Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction.

Real world example - using Air Quality Dataset

Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation.

How to do it

Perform the following step to see all the datasets in R and using airquality:

> data()

> str(airquality)

Output

'data.frame': 153 obs. of 6 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...

$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

> head(airquality)

Output

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8.0 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

5 NA NA 14.3 56 5 5

6 28 NA 14.9 66 5 6

How it works

The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also

holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money.

explaining-data-exploration-in-under-a-minute-img-0

You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests.

If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.