Home Data News Pandas on Ray: Make Pandas faster by replacing one line of your...

Pandas on Ray: Make Pandas faster by replacing one line of your code

March 6, 2018 - 12:00 am

2651

2 min read

Pandas on Ray is the latest development in the Ray framework. It is a DataFrame library that wraps Pandas and provides a transparent distribution of data and computation. Pandas on Ray is targeted towards existing Pandas users who are looking to improve performance and see faster runtimes without having to switch to another API. It accelerates Pandas queries by 4 times on an 8-core machine. This requires users to change just a single line of code in their notebooks.

Ray: A machine learning substitute for Apache Spark

Developed by two Ph.D.students, Philipp Moritz and Robert Nishihara, at the RISELab, Ray is a a distributed execution framework for AI applications and also a potential project to replace Apache Spark. RISELab is the successor to the U.C.Berkeley group, which created Apache Spark.

Apache Spark was designed to be faster than its forerunner, MapReduce, but still faced issues with design decisions which made it difficult to write applications that included Complex task dependencies. This was mainly because of Spark’s internal synchronization mechanisms. Ray was designed to provide better speeds than Apache Spark.

Ray, is designed to provide better speeds than even Apache Spark. It is written in C++ and aims at accelerating the execution of machine learning algorithms developed in Python. It makes use of an immutable object model–any objects that can be made immutable don’t need to be synchronized across the cluster–which save a lot of time. Also, Ray maintains a state of computation among various other nodes in the cluster, which in turn maximizes robustness.

Additional features include:

Ray can handle heterogeneous hardware (where some application workload is being executed on CPUs and others on GPUs) as it has a number of schedulers that can bring both CPUs and GPUs together.
It can also borrow task-dependency attributes from MPI, the low-level distributed programming environment.
Ray is also useful for building an array of applications that require fast decision-making on real-world data such as what’s required for autonomous driving and so on.

Pandas on Ray

On comparing Pandas with Pandas on Ray, following results were obtained:

Pandas on Ray:

100 loops, best of 3: 4.14 ms per loop

Pandas:

The slowest run took 32.21 times longer than the fastest. This could mean that an intermediate result is being cached.

1 loop, best of 3: 17.3 ms per loop

This concluded, Pandas on Ray is about 4 times faster than Pandas. This was run on a machine with eight cores, so the speedup isn’t perfect because of the overheads.

Here, no special optimizations were done for Pandas on Ray; only the default settings were used in this experimentation. Also, Ray uses Eager execution and thus one cannot have query planning or have advanced knowledge of the best way to compute a given workflow.

To know more about Ray in detail, visit its GitHub repository. Also, to more about Pandas on Ray at the RISELab blog.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Pandas on Ray: Make Pandas faster by replacing one line of your code

Ray: A machine learning substitute for Apache Spark

Pandas on Ray

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter