Data is the new oil and is just as crude as unrefined oil. To do anything meaningful – modeling, visualization, machine learning, for predictive analysis – you first need to wrestle and wrangle with data. We recently interviewed Dr. Tirthajyoti Sarkar and Shubhadeep Roychowdhury, the authors of the course Data Wrangling with Python. They talked about their new course and discuss why do data wrangling and why use Python to do it.
Key Takeaways
- Python boasts of a large, broad library equipped with a rich set of modules and functions, which you can use to your advantage and manipulate complex data structures
- NumPy, the Python library for fast numeric array computations and Pandas, a package with fast, flexible, and expressive data structures are helpful in working with “relational” or “labeled” data.
- Web scraping or data extraction becomes easy and intuitive with Python libraries, such as BeautifulSoup4 and html5lib.
- Regex, the tiny, highly specialized programming language inside Python can create patterns that help match, locate, and manage text for large data analysis and searching operations
- Present interesting, interactive visuals of your data with Matplotlib, the most popular graphing and data visualization module for Python.
- Easily and quickly separate information from a huge amount of random data using Pandas, the preferred Python tool for data wrangling and modeling.
Full Interview
Congratulations on your new course ‘Data wrangling with Python’. What this course is all about?
Data science is the ‘sexiest job’ of 21st century’ (at least until Skynet takes over the world). But for all the emphasis on ‘Data’, it is the ‘Science’ that makes you – the practitioner – valuable. To practice high-quality science with data, first you need to make sure it is properly sourced, cleaned, formatted, and pre-processed.
This course teaches you the most essential basics of this invaluable component of the data science pipeline – data wrangling.
What is data wrangling and why should you learn it well?
“Data is the new Oil” and it is ruling the modern way of life through incredibly smart tools and transformative technologies. But oil from the rig is far from being usable. It has to be refined through a complex processing network.
Similarly, data needs to be curated, massaged and refined to become fit for use in intelligent algorithms and consumer products. This is called “wrangling” and (according to CrowdFlower) all good data scientists spend almost 60-80% of their time on this, each day, every project.
It generally involves the following:
- Scraping the raw data from multiple sources (including web and database tables),
- Inputing, formatting, transforming – basically making it ready for use in the modeling process (e.g. advanced machine learning),
- Handling missing data gracefully,
- Detecting outliers, and
- Being able to perform quick visualizations (plotting) and basic statistical analysis to judge the quality of your formatted data
This course aims to teach you all the core ideas behind this process and to equip you with the knowledge of the most popular tools and techniques in the domain. As the programming framework, we have chosen Python, the most widely used language for data science. We work through real-life examples and at the end of this course, you will be confident to handle a myriad array of sources to extract, clean, transform, and format your data for further analysis or exciting machine learning model building.
Walk us through your thinking behind how you went about designing this course. What’s the flow like? How do you teach data wrangling in this course?
The lessons start with a refresher on Python focusing mainly on advanced data structures, and then quickly jumping into NumPy and Panda libraries as fundamental tools for data wrangling.
It emphasizes why you should stay away from traditional ways of data cleaning, as done in other languages, and take advantage of specialized pre-built routines in Python.
Thereafter, it covers how using the same Python backend, one can extract and transform data from a diverse array of sources – internet, large database vaults, or Excel financial tables.
Further lessons teach how to handle missing or wrong data, and reformat based on the requirement from a downstream analytics tool.
The course emphasizes learning by real example and showcases the power of an inquisitive and imaginative mind primed for success.
What other tools are out there? Why do data wrangling with Python?
First, let us be clear that there is no substitute for the data wrangling process itself. There is no short-cut either. Data wrangling must be performed before the modeling task but there is always the debate of doing this process using an enterprise tool or by directly using a programming language and associated frameworks. There are many commercial, enterprise-level tools for data formatting and pre-processing, which does not involve coding on the part of the user.
Common examples of such tools are:
- General purpose data analysis platforms such as Microsoft Excel (with add-ins)
- Statistical discovery package such as JMP (from SAS)
- Modeling platforms such as RapidMiner
- Analytics platforms from niche players focusing on data wrangling such as – Trifacta, Paxata, Alteryx
At the end of the day, it really depends on the organizational approach whether to use any of these off-the-shelf tools or to have more flexibility, control, and power by using a programming language like Python to perform data wrangling. As the volume, velocity, and variety (three V’s of Big Data) of data undergo rapid changes, it is always a good idea to develop and nurture significant amount of in-house expertise in data wrangling. This is done using fundamental programming frameworks so that an organization is not betrothed to the whims and fancies of any particular enterprise platform as a basic task as data wrangling.
Some of the obvious advantages of using an open-source, free programming paradigm like Python for data wrangling are:
- General purpose open-source paradigm putting no restriction on any of the methods you can develop for the specific problem at hand
- Great eco-system of fast, optimized, open-source libraries, focused on data analytics
- Growing support to connect Python for every conceivable data source types,
- Easy interface to basic statistical testing and quick visualization libraries to check data quality
- Seamless interface of the data wrangling output to advanced machine learning models – Python is the most popular language of choice of machine learning/artificial intelligence these days.
What are some best practices to perform data wrangling with Python?
Here are five best practices that will help you out in your data wrangling journey with Python. And in the end, all you’ll have is clean and ready to use data for your business needs.
- Learn the data structures in Python really well
- Learn and practice file and OS handling in Python
- Have a solid understanding of core data types and capabilities of Numpy and Pandas
- Build a good understanding of basic statistical tests and a panache for visualization
- Apart from Python, if you want to master one language, go for SQL
What are some misconceptions about data wrangling?
Though data wrangling is an important task, there are certain myths associated with data wrangling which developers should be cautious of. Myths such as:
- Data wrangling is all about writing SQL query
- Knowledge of statistics is not required for data wrangling
- You have to be a machine learning expert to do great data wrangling
- Deep knowledge of programming is not required for data wrangling
Learn in detail about these misconceptions. We hope that these misconceptions would help you realize that data wrangling is not as difficult as it seems. Have fun wrangling data!
About the authors
Dr. Tirthajyoti Sarkar works as a Sr. Principal Engineer in the semiconductor technology domain where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics.
Shubhadeep Roychowdhury works as a Sr. Software Engineer at a Paris based Cyber Security startup. He holds a Master Degree in Computer Science from West Bengal University Of Technology and certifications in Machine Learning from Stanford.
Read Next
5 best practices to perform data wrangling with Python
4 misconceptions about data wrangling
Data cleaning is the worst part of data analysis, say data scientists