4 min read

In this article, by Sampath Kumar Kanthala, the author of the book Practical Data Analysis discusses how to obtain, clean, normalize, and transform raw data into a standard format like CVS or JSON using OpenRefine.

In this article we will cover:

  • Data Scrubbing
  • Statistical methods
  • Text Parsing
    • Data Transformation

(For more resources related to this topic, see here.)

Data scrubbing

Scrubbing data also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.

The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That’s why the next step after obtaining the data, is the data scrubbing. In order to avoid dirty data our dataset should possess the following characteristics:

  • Correct
  • Completeness
  • Accuracy
  • Consistency
  • Uniformity

Dirty data can be detected by applying some simple statistical data validation also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.

Statistical methods

In this method we need some context about the problem (knowledge domain) to find values that are unexpected and thus erroneous, even if the data type match but the values are out of the range, it can be resolved by setting the values to an average or mean value. Statistical validations can be used to handle missing values which can be replaced by one or more probable values using Interpolation or by reducing the data set using decimation.

  • Mean: Value calculated by summing up all values and then dividing by the number of values.
  • Median: The median is defined as the value where 50% of values in a range will be below, 50% of values above the value.
  • Range constraints: Numbers or dates should fall within a certain range. That is, they have minimum and/or maximum possible values.
  • Clustering: Usually, when we obtain data directly from the user some values include ambiguity or refer to the same value with a typo. For example, “Buchanan Deluxe 750ml 12×01 “and “Buchanan Deluxe 750ml   12×01.” which are different only by a “.” or in the case of “Microsoft” or “MS” instead of “Microsoft Corporation” which refer to the same company and all values are valid. In those cases, grouping can help us to get accurate data and eliminate duplicated enabling a faster identification of unique values.

Text parsing

We perform parsing to help us to validate if a string of data is well formatted and avoid syntax errors.

Regular expression patterns usually, text fields would have to be validated this way. For example, dates, e-mail, phone numbers, and IP address. Regex is a common abbreviation for “regular expression“):

In Python we will use re module to implement regular expressions. We can perform text search and pattern validations.

First, we need to import the re module.

import re

In the follow examples, we will implement three of the most common validations (e-mail, IP address, and date format).

  • E-mail validation:
myString = 'From: [email protected] (readers email)'

result = re.search('([w.-]+)@([w.-]+)', myString)

if result:

   print (result.group(0))

   print (result.group(1))

   print (result.group(2))

Output:

>>> [email protected]

>>> readers

>>> packt.com

The function search() scans through a string, searching for any location where the Regex matches. The function group() helps us to return the string matched by the Regex. The pattern w matches any alphanumeric character and is equivalent to the class [a-zA-Z0-9_].

  • IP address validation:
    isIP = re.compile('d{1,3}.d{1,3}.d{1,3}.d{1,3}')
    
    myString = " Your IP is: 192.168.1.254 "
    
    result = re.findall(isIP,myString)
    
    print(result)

Output:

>>> 192.168.1.254 

The function findall() finds all the substrings where the Regex matches, and returns them as a list. The pattern d matches any decimal digit, is equivalent to the class [0-9].

  • Date format:
    myString = "01/04/2001"
    
    isDate = re.match('[0-1][0-9]/[0-3][0-9]/[1-2][0-9]{3}',
    myString)
    
    if isDate:
    
       print("valid")
    
    else:
    
       print("invalid")

Output:

>>> 'valid' 

The function match() finds if the Regex matches with the string. The pattern implements the class [0-9] in order to parse the date format.

For more information about regular expressions:

http://docs.python.org/3.4/howto/regex.html#regex-howto

Data transformation

Data transformation is usually related with databases and data warehouse where values from a source format are extract, transform, and load in a destination format.

Extract, Transform, and Load (ETL) obtains data from data sources, performs some transformation function depending on our data model and loads the result data into destination.

  • Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, XML), and NoSQL databases.
  • Data transformation allows us to cleanse, convert, aggregate, merge, replace, validate, format, and split data.
  • Data loading allows us to load data into destination format, like relational databases, text files (JSON, CSV, XML), and NoSQL databases.

In statistics data transformation refers to the application of a mathematical function to the dataset or time series points.

Summary

In this article, we explored the common data sources and implemented a web scraping example. Next, we introduced the basic concepts of data scrubbing like statistical methods and text parsing.

Resources for Article:

 


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here