With businesses generating data at an enormous rate today, many Big Data processing alternatives such as Apache Hadoop, Spark, Flink, and more have emerged in the last few years. Apache Spark among them has gained a lot of popularity of late, as it offers ease of use and sophisticated analytics, and helps you process data with speed and efficiency.
[author title=”Romeo Kienzler” image=”https://www.linkedin.com/in/romeo-kienzler-089b4557/detail/photo/“]Chief Data Scientist in the IBM Watson IoT worldwide team, has been helping clients all over the world find insights from their IoT data using Apache Spark. An Associate Professor for Artificial Intelligence at Swiss University of Applied Sciences, Berne, he is also a member of the IBM Technical Expert Council and the IBM Academy of Technology, IBM’s leading brains trust.[/author]
In this interview, Romeo talks about his new book on Apache Spark and Spark’s evolution from just a data processing framework to becoming a solid, all-encompassing platform for real-time processing, streaming analytics and distributed Machine Learning.
I’ve shifted completely away from non-scalable environments like R and python pandas. I’ve also shifted away from scala for prototyping. I’m using scala only for mission-critical applications which have to be maintained for the long term. Otherwise, I’m using python. I’m trying to completely stay on Apache Spark for everything I’m doing which is feasible since Spark supports:
The advantage is that everything I’m doing is scalable by definition and once I need it I can scale without changing code.
Scaling on very large clusters is still tricky with Apache Spark because at a certain point scale-out is not linear anymore. So, a lot of tweaking of the various knobs is necessary. Also, the Spark API somehow is slightly more tedious that the one of R or python Pandas – so it needs some energy to really stick with it and not to go back to “the good old R-Studio”.
Next, I think the strategic shift from RDDs to DataFrames and Datasets was a disrupting but necessary step. In the book, I try to justify this step and first explain how the new API and the two related projects Tungsten and Catalyst work. Then I show how things like machine learning, streaming, and graph processing are done in the traditional, RDD based way as well as in the new DataFrames and Datasets based way.
The most significant change, in my opinion, was labeling Structured Streaming GA and no longer as experimental. Otherwise, there have been “only” minor improvements, mainly on performance, 72 to be precise as all are documented in JIRA since it is an Apache project. The most significant improvement between version 1.6 to 2.0 was whole stage code generation in Tungsten which is also covered in this book.
Actually, Apache Spark takes it to the next level by introducing the concept of continuous applications. So with Apache Spark, the streaming and batch API have been unified that you actually don’t have to care anymore on what type of data you are running your queries on. You can even mix and match. For example joining a structured stream, a relational database, a NoSQL database and a file in HDFS within a single SQL statement. Everything is possible.
Back in 2015, Apache Spark was just another framework within the Hadoop ecosystem. Now, Apache Spark has grown to be one of the largest open source projects on this planet! Apache Spark is the new big data operating system like Hadoop was back in 2015. AI and Deep Learning are the most important trends and as explained in this book, Frameworks like H2O, DeepLearning4J and Apache SystemML are using Apache Spark as their big data operation system to scale.
I think I’ve done a very good job in taking real-life examples from my work and finding a good open data source or writing a good simulator to give hands-on experience in solving real-world problems. So in the book, you should find a recipe for all the current data science problems you find in the industry.
This partnership underpins IBM’s strong commitment to open source. Not only is IBM contributing to Apache Spark, IBM also creates new open source projects on top of it. The most prominent example is Apache SystemML which is also covered in this book. The next three years are dedicated to DeepLearning and AI. And IBM’s open source contributions will help the Apache Spark community to succeed. The most prominent example is PowerAI where IBM outperformed all state-of-the-art deep learning technologies for image recognition.
I suggest taking a Machine Learning course of one of the leading online training vendors. Then take a Spark course (or read my book). Finally, try to do everything yourself. Participate in Kaggle competitions and try to replicate papers.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…