Categories: NewsData

Pandas on Ray: Make Pandas faster by replacing one line of your code

2 min read

Pandas on Ray is the latest development in the Ray framework. It is a DataFrame library that wraps Pandas and provides a transparent distribution of data and computation. Pandas on Ray is targeted towards existing Pandas users who are looking to improve performance and see faster runtimes without having to switch to another API. It accelerates Pandas queries by 4 times on an 8-core machine. This requires users to change just a single line of code in their notebooks.

Ray: A machine learning substitute for Apache Spark

Developed by two Ph.D.students, Philipp Moritz and Robert Nishihara, at the RISELab, Ray is a a distributed execution framework for AI applications and also a potential project to replace Apache Spark. RISELab is the successor to the U.C.Berkeley group, which created Apache Spark.

Apache Spark was designed to be faster than its forerunner, MapReduce, but still faced issues with design decisions which made it difficult to write applications that included Complex task dependencies. This was mainly because of Spark’s internal synchronization mechanisms. Ray was designed to provide better speeds than Apache Spark.

Ray, is designed to provide better speeds than even Apache Spark. It is written in C++ and aims at accelerating the execution of machine learning algorithms developed in Python. It makes use of an immutable object model–any objects that can be made immutable don’t need to be synchronized across the cluster–which save a lot of time. Also, Ray maintains a state of computation among various other nodes in the cluster, which in turn maximizes robustness.

Additional features include:

Ray can handle heterogeneous hardware (where some application workload is being executed on CPUs and others on GPUs) as it has a number of schedulers that can bring both CPUs and GPUs together.
It can also borrow task-dependency attributes from MPI, the low-level distributed programming environment.
Ray is also useful for building an array of applications that require fast decision-making on real-world data such as what’s required for autonomous driving and so on.

Pandas on Ray

On comparing Pandas with Pandas on Ray, following results were obtained:

Pandas on Ray:

100 loops, best of 3: 4.14 ms per loop

Pandas:

The slowest run took 32.21 times longer than the fastest. This could mean that an intermediate result is being cached.

1 loop, best of 3: 17.3 ms per loop

This concluded, Pandas on Ray is about 4 times faster than Pandas. This was run on a machine with eight cores, so the speedup isn’t perfect because of the overheads.

Here, no special optimizations were done for Pandas on Ray; only the default settings were used in this experimentation. Also, Ray uses Eager execution and thus one cannot have query planning or have advanced knowledge of the best way to compute a given workflow.

To know more about Ray in detail, visit its GitHub repository. Also, to more about Pandas on Ray at the RISELab blog.

Savia Lobo

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.