Home Data 4 misconceptions about data wrangling

4 misconceptions about data wrangling

October 17, 2018 - 2:00 pm

2751

3 min read

Around 80% of the time in data analysis is spent on cleaning and preparing data for analysis. This is, however, an important task, and is a prerequisite to the rest of the data analysis workflow, including visualization, analysis, and reporting. Although, being an important task given its nature, there are certain myths associated with data wrangling which developers should be cautious of.

In this post, we will discuss four such misconceptions.

Myth #1: Data wrangling is all about writing SQL query

There was a time when data processing needed data to be presented in a relational manner so that SQL queries could be written. Today, there are many other types of data sources in addition to the classic static SQL databases, which can be analyzed. Often, an engineer has to pull data from diverse sources such as web portals, Twitter feeds, sensor fusion streams, police or hospital records. Static SQL query can help only so much in those diverse domains.

A programmatic approach, which is flexible enough to

interface with myriad sources and
is able to parse the raw data through clever algorithmic techniques
and use of fundamental data structures (trees, graphs, hash tables, heaps),

will be the winner.

Myth #2: Knowledge of statistics is not required for data wrangling

Quick statistical tests and visualizations are always invaluable to check the ‘quality’ of the data you sourced. These tests can help detect outliers and wrong data entry, without running complex scripts. For effective data wrangling, you don’t need to have knowledge of advanced statistics. However, you must understand basic descriptive statistics and know how to execute them using built-in Python libraries.

Myth #3: You have to be a machine learning expert to do great data wrangling

Deep knowledge of machine learning is certainly not a pre-requisite for data wrangling. It is true that the end goal of data wrangling is often to prepare the data so that it can be used in a machine learning task downstream.

As a data wrangler, you do not have to know all the nitty-gritties of your project’s machine learning pipeline. However, it is always a good idea to talk to the machine learning expert who will use your data and understand the data structure interface and format he/she needs to run the model fast and accurately.

Myth #4: Deep knowledge of programming is not required for data wrangling

As explained above, the diversity and complexity of data sources require that you are comfortable with deep notions of fundamental data structures and how a programming language paradigm handles them. Increasing deep knowledge of the programming framework (Python for example) will surely help you to come up with innovative methods for dealing with data source interfacing and data cleaning issues. The speed and efficiency of your data processing pipeline can often be benefited from using advanced knowledge of basic algorithms e.g. search, sort, graph traversal, hash table building, etc. Although built-in methods in standard libraries are optimized, having this knowledge gives you an edge for any situation.

You read a guest post from Tirthajyoti Sarkar and Shubhadeep Roychowdhury, the authors of Data Wrangling with Python.

We hope that these misconceptions would help you realize that data wrangling is not as difficult as it seems. Have fun wrangling data!

About the authors

Dr. Tirthajyoti Sarkar works as a Sr. Principal Engineer in the semiconductor technology domain where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics.

Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup. He holds a Master Degree in Computer Science from West Bengal University Of Technology and certifications in Machine Learning from Stanford.

Don’t forget to check out Data Wrangling with Python to learn the essential basics of data wrangling using Python.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

4 misconceptions about data wrangling

Myth #1: Data wrangling is all about writing SQL query

Myth #2: Knowledge of statistics is not required for data wrangling

Myth #3: You have to be a machine learning expert to do great data wrangling

Myth #4: Deep knowledge of programming is not required for data wrangling

About the authors

Read Next

Must Read in Cloud & Networking

Top life hacks for prepping for your IT certification exam

Learning Essential Linux Commands for Navigating the Shell Effectively

ServiceNow Partners with IBM on AIOps from DevOps.com

Must Read in Data

Learn Transformers for Natural Language Processing with Denis Rothman

Scientific Analysis of Donald Trump’s Tweets on COVID-19 with Transformers

Distributed training in TensorFlow 2.x

Interviews

Learn Transformers for Natural Language Processing with Denis Rothman

Clean Coding in Python with Mariano Anaya

Bringing AI to the B2B world: Catching up with Sidetrade CTO Mark Sheldon [Interview]

On Adobe InDesign 2020, graphic designing industry direction and more: Iman Ahmed, an Adobe Certified Partner and Instructor [Interview]

Is DevOps experiencing an identity crisis? [Interview]

MobilePro

datapro

Programming

Subscribe to our newsletter