Andy Grove, a software engineer introduced Ballista, a distributed compute platform and in his recent blog post, he explained his journey on this project. Roughly around eighteen months ago, he started the DataFusion project, an in-memory query engine that uses Apache Arrow as the memory model. The aim was to build a distributed compute platform in Rust that can compete with Apache Spark but which turned out to be difficult for him.
Grove writes in a blog post, “Unsurprisingly, this turned out to be an overly ambitious goal at the time and I fell short of achieving that. However, some very good things came out of this effort. We now have a Rust implementation of Apache Arrow with a growing community of committers, and DataFusion was donated to the Apache Arrow project as an in-memory query execution engine and is now starting to see some early adoption.”
He then took a break from working on Arrow and DataFusion for a couple of months and focused on some deliverables at work.
He then started a new PoC (Proof of Concept) project which was his second attempt at building a distributed platform with Rust. But this time he had the advantage of already having Arrow and DataFusion in his plate.
A Ballista cluster currently comprises of a number of individual pods within a Kubernetes cluster and it can be created and destroyed via the Ballista CLI. Ballista applications can be deployed to Kubernetes with the help of Ballista CLI and they use Kubernetes service discovery for connecting to the cluster. Since there is no distributed query planner yet, Ballista applications must manually build the query plans that need to be executed on the cluster.
To make this project practically work and push it beyond the limit of just a PoC, Grove listed some of the things on the roadmap for v1.0.0:
- First is to implement a distributed query planner.
- Then bringing support for all DataFusion logical plans and expressions.
- User code has to be supported as part of distributed query execution.
- They plan to bring support for interactive SQL queries against a cluster with gRPC.
- Support for Arrow Flight protocol and Java bindings.
This PoC project will help in driving the requirements for DataFusion and it has already led to three DataFusion PRs that are being merged into the Apache Arrow codebase.
It seems there are mixed reviews for this initiative, a user commented on HackerNews, “Hang in there mate 🙂 I really don’t think you deserve a lot of the crap you’ve been given in this thread. Someone has to try something new.”
Another user commented, “The fact people opposed to your idea/work means it is valuable enough for people to say something against and not ignore it.”
To know more about this news, check out the official announcement.