Python’s rise as the preferred language of choice in Data Science is unprecedented, but not really unexpected. Apart from being a general-purpose language which can be used for a variety of tasks – from scripting to networking, Python offers a rich suite of libraries for general data science tasks such as scientific computing, data visualization, and more. However, one big challenge faced by the data scientists is that these packages are not designed for scale. This is crucial in today’s Big Data era where tons of data needs to be processed and analyzed on the go. A platform which supports the existing Python ecosystem and allows it to scale across multiple machines and clusters without affecting the performance was conspicuously missing. Enter Dask.
What is Dask?
Dask is a flexible parallel computing library written in Python for analytics, designed mainly to offer scalability and enhanced power to the existing packages and libraries. It allows the users to integrate their existing Python-based projects written in popular libraries such as NumPy, SciPy, pandas, and more.
Architecture is demonstrated in the diagram below:
Architecture (Image courtesy: Slideshare)
The 2 key components of Dask that interact with the Python libraries are:
- Dynamic task schedulers – which takes care of the intensive computational workloads
- ‘Big Data’ Dask collections – consisting of dataframes, parallel arrays and interfaces that allow for the computations to run on distributed environments
Why use Dask?
Given there are already quite a few distributed platforms for large-scale data processing such as Apache Spark, Apache Storm, Flink and so on, why and when should one go for Dask? What are the advantages offered by this Python library? Let us take a look at the 4 major reasons to prefer Dask for distributed, scalable analytics in Python:
Easy to get started:
If you are an existing Python user, you must have already worked with popular Python packages such as NumPy, SciPy, matplotlib, scikit-learn, pandas, and more. Dask offers a similar, intuitive interface and since it is a part of the bigger Python ecosystem, getting started with Dask is very easy. It uses the existing Python APIs to switch between the popular packages and their Dask-equivalents, so you don’t have to spend a lot of time in porting the code.
For absolute beginners, using Dask for scalable analytics would be an easier and logical option to pursue, once they have grasped the fundamentals of Python and the associated libraries.
Scales up and down quite easily:
You can run your project on Dask on a single machine, or on a cluster with thousands of cores without essentially affecting the speed and performance of your code. Dask uses the multi-core CPUs within a single system optimally to process hundreds of terabytes of data without the need for additional hardware. Similarly, for moderate to large datasets spanning 100+ gigabytes which often don’t fit into a single storage device, the computing power of the clusters can be coupled with Dask for effective analytics.
Supports complex applications:
Many companies tend to tackle complex computations by introducing custom codes that run on popular Big Data tools such as Hadoop MapReduce and Apache Spark. However, with the help of the dynamic task schedule feature of Dask, it is now possible to run and process complex applications without introducing any additional code. Dask is solely responsible for the smooth handling of various tasks such as network communication, load balancing and diagnostics, among the others.
Clear, responsive, real-time feedback:
One of the most important features of Dask is its user-friendliness. Dask provides a real-time dashboard that highlights the key metrics of the processing task undertaken by the user – such as the current progress of your project, memory consumption and more. It also offers an in-built IPython kernel that allows the user to investigate the ongoing computation with just a terminal.
How Dask compares with Apache Spark
Apache Spark is one of the most popular and widely used Big Data tools for distributed data processing and analytics. Dask and Apache Spark have many features in common, prompting us and many other developers to ask the question – which tool is better? While Spark has been around for quite some and has many standard, stable features over years of development, Dask is quite new and is still being improved as a tool. We summarize the important differences between Dask and Apache Spark in the table below:
CriteriaApache SparkDaskPrimary languageScalaPythonScaleSupports a single node to thousands of nodes in the clusterSupports a single node to thousands of nodes in the clusterEcosystemAll-in-one self-sufficient ecosystemIntegration with popular libraries within the Python ecosystemFlexibilityLowHighStream processingBuilt-in module called Spark Streaming presentReal-time interface which is pretty low-level, requires more work than Apache SparkGraph processingPossible with GraphX moduleNot possible Machine learningUses the Spark MLlib moduleIntegrates with scikit-learn and XGBoostPopularityVery high, commonly used tool in the Big Data ecosystemFairly new tool but has already found its place in the pandas, scikit-learn and Jupyter stack
You can read a detailed comparison of Apache Spark and Dask on the official Dask documentation page.
What we can expect from Dask
As we saw from the comparison above, it is fairly easy to port an existing Python project using several high-profile Python libraries such as NumPy, scikit-learn and more. Python developers and data scientists will appreciate the high flexibility and complex computational capabilities offered by Dask. The limited stream processing and graph processing features are big areas of improvement, but we can expect some developments in this domain in the near future.
Even though Dask is still relatively new, it looks very promising due to its close affinity with the Python ecosystem. With Python’s clout rising, many people would prefer a Python-based data processing tool which works at scale, without having to switch to an external Big Data framework. Dask may well be the superhero to come to the developers’ rescue, in such cases.
You can learn more about the latest developments in Dask on their official GitHub page.