Pandas on Ray is the latest development in the Ray framework. It is a DataFrame library that wraps Pandas and provides a transparent distribution of data and computation. Pandas on Ray is targeted towards existing Pandas users who are looking to improve performance and see faster runtimes without having to switch to another API. It accelerates Pandas queries by 4 times on an 8-core machine. This requires users to change just a single line of code in their notebooks.
Developed by two Ph.D.students, Philipp Moritz and Robert Nishihara, at the RISELab, Ray is a a distributed execution framework for AI applications and also a potential project to replace Apache Spark. RISELab is the successor to the U.C.Berkeley group, which created Apache Spark.
Apache Spark was designed to be faster than its forerunner, MapReduce, but still faced issues with design decisions which made it difficult to write applications that included Complex task dependencies. This was mainly because of Spark’s internal synchronization mechanisms. Ray was designed to provide better speeds than Apache Spark.
Ray, is designed to provide better speeds than even Apache Spark. It is written in C++ and aims at accelerating the execution of machine learning algorithms developed in Python. It makes use of an immutable object model–any objects that can be made immutable don’t need to be synchronized across the cluster–which save a lot of time. Also, Ray maintains a state of computation among various other nodes in the cluster, which in turn maximizes robustness.
Additional features include:
On comparing Pandas with Pandas on Ray, following results were obtained:
Pandas on Ray:
100 loops, best of 3: 4.14 ms per loop
Pandas:
The slowest run took 32.21 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 17.3 ms per loop
This concluded, Pandas on Ray is about 4 times faster than Pandas. This was run on a machine with eight cores, so the speedup isn’t perfect because of the overheads.
Here, no special optimizations were done for Pandas on Ray; only the default settings were used in this experimentation. Also, Ray uses Eager execution and thus one cannot have query planning or have advanced knowledge of the best way to compute a given workflow.
To know more about Ray in detail, visit its GitHub repository. Also, to more about Pandas on Ray at the RISELab blog.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…