2 min read

Qubole has announced the availability of a working implementation of Apache Spark on AWS Lambda. The big data-as-a-service company said the prototype has been able to show a successful scan of 1 TB of data and sort 100 GB of data from AWS Simple Storage Service (S3).

Qubole said the ability to run Spark on Lambda, a serverless compute service that allows users to only pay for the compute power they use without needing to provision servers, makes the platform more elastic and efficient with its resource usage.

Earlier, it was a challenge to run Spark on AWS Lambda. Mainly due to Spark’s inability to communicate directly with Lambda (something it needs to do in order to be able to run its executors). Also, Lambda’s limited runtime resources (limited to a maximum execution duration of five minutes, 1,536 MB memory and 512 MB disk space) makes it extremely difficult for a memory-hungry platform like Spark to run.

The Spark on Lambda service overcomes both these limitations. Qubole said it performed some technical wizardry to ensure the service runs its executors from within an AWS Lambda invocation, thereby sidestepping the communication issues. And then, Lambda’s limited runtime resources issue was dealt with by using external storage to avoid local disk size limits.

Spark on Lambda’s elasticity works perfectly for a number of use cases, including:

  • Interactive and ad-hoc data analysis where compute on demand is critical.
  • ETL transformation of click stream, access logs or even data science workloads. The necessary data pre-processing and preparation can fit perfectly into AWS Lambda runtimes.
  • Streaming applications with a discrete flow of events and varying queue length are perfect candidate for Spark on Lambda’s elasticity.

“Qubole customers run some of the largest Spark clusters in the world. We wanted to show that a complex technology like Spark can be implemented on a serverless compute infrastructure like Lambda and scale efficiently,” Qubole CEO Ashish Thusoo said. “Spark on Lambda can eliminate most of the operational complexities of running Spark clusters, handle bursty workloads more effectively and be more cost efficient.”

Qubole said Spark on Lambda is currently available as a technology preview and the company will demonstrate its capabilities during the AWS Re:Invent 2017 conference in Las Vegas at Sands Expo booth 834 and Aria booth 201. The code is available on Github at https://github.com/qubole/spark-on-lambda.

Writes and reports on lnformation Technology. Full stack on artificial intelligence, data science, and music.


Please enter your comment!
Please enter your name here