With businesses generating data at an enormous rate today, many Big Data processing alternatives such as Apache Hadoop, Spark, Flink, and more have emerged in the last few years. Apache Spark among them has gained a lot of popularity of late, as it offers ease of use and sophisticated analytics, and helps you process data with speed and efficiency.
In this interview, Romeo talks about his new book on Apache Spark and Spark’s evolution from just a data processing framework to becoming a solid, all-encompassing platform for real-time processing, streaming analytics and distributed Machine Learning.
- Apache Spark has evolved to become a full-fledged platform for real-time batch processing and stream processing.
- Its in-memory computing capabilities allow for efficient streaming analytics, graph processing, and machine learning.
- It gives you the ability to work with your data at scale, without worrying if it is structured or unstructured.
- Popular frameworks like H2O and DeepLearning4J are using Apache Spark as their preferred platform for distributed AI, Machine Learning, and Deep Learning.
As a data scientist and an assistant professor, you must have used many tools both for your work and for research? What are some key criteria one must evaluate while choosing a big data analytics solution? What are your go-to tools and where does Spark rank among them?
- Scalability. Make sure you can use a cluster to accelerate execution of your processes
- TCO – How much do I have to pay for licensing and deployment. Consider the usage of Open Source (but keep maintenance in mind). Also, consider Cloud.
I’ve shifted completely away from non-scalable environments like R and python pandas. I’ve also shifted away from scala for prototyping. I’m using scala only for mission-critical applications which have to be maintained for the long term. Otherwise, I’m using python. I’m trying to completely stay on Apache Spark for everything I’m doing which is feasible since Spark supports:
- Machine Learning
The advantage is that everything I’m doing is scalable by definition and once I need it I can scale without changing code.
What does the road to mastering Apache Spark look like? What are some things that users may not have known about Apache Spark? Can readers look forward to learning about some of them in your new book: Mastering Apache Spark, second edition?
Scaling on very large clusters is still tricky with Apache Spark because at a certain point scale-out is not linear anymore. So, a lot of tweaking of the various knobs is necessary. Also, the Spark API somehow is slightly more tedious that the one of R or python Pandas – so it needs some energy to really stick with it and not to go back to “the good old R-Studio”.
Next, I think the strategic shift from RDDs to DataFrames and Datasets was a disrupting but necessary step. In the book, I try to justify this step and first explain how the new API and the two related projects Tungsten and Catalyst work. Then I show how things like machine learning, streaming, and graph processing are done in the traditional, RDD based way as well as in the new DataFrames and Datasets based way.
What are the top 3 data analysis challenges that never seem to go away even as time and technology keep changing? How does Spark help alleviate them?
- Data quality. Data is often noisy and in bad formats. The majority of the time I spend improving it through various methodologies. Apache Spark helps me to scale. SparkSQL and SparkML pipelines introduce a standardized framework for doing so.
- Unstructured data preparation. A lot of data is unstructured in the form of text. Apache Spark allows me to pre-process vast amount of text and create tiny mathematical representations out of it for downstream analysis.
- Instability on technology. Every six months there is a new hype which seems to make everything you’ve learned redundant. So, for example, there exist various scripting languages for big data. SparkSQL ensures that I can use my already acquired SQL skills now and in future.
How is the latest Apache Spark 2.2.0 a significant improvement over the previous version?
The most significant change, in my opinion, was labeling Structured Streaming GA and no longer as experimental. Otherwise, there have been “only” minor improvements, mainly on performance, 72 to be precise as all are documented in JIRA since it is an Apache project. The most significant improvement between version 1.6 to 2.0 was whole stage code generation in Tungsten which is also covered in this book.
Streaming analytics has become mainstream. What role did Apache Spark play in leading this trend?
Actually, Apache Spark takes it to the next level by introducing the concept of continuous applications. So with Apache Spark, the streaming and batch API have been unified that you actually don’t have to care anymore on what type of data you are running your queries on. You can even mix and match. For example joining a structured stream, a relational database, a NoSQL database and a file in HDFS within a single SQL statement. Everything is possible.
Mastering Apache Spark was first published back in 2015. Big data has greatly evolved since then. What does the second edition of Mastering Apache Spark offer readers today in this context?
Back in 2015, Apache Spark was just another framework within the Hadoop ecosystem. Now, Apache Spark has grown to be one of the largest open source projects on this planet! Apache Spark is the new big data operating system like Hadoop was back in 2015. AI and Deep Learning are the most important trends and as explained in this book, Frameworks like H2O, DeepLearning4J and Apache SystemML are using Apache Spark as their big data operation system to scale.
I think I’ve done a very good job in taking real-life examples from my work and finding a good open data source or writing a good simulator to give hands-on experience in solving real-world problems. So in the book, you should find a recipe for all the current data science problems you find in the industry.
2015 was also the year when Apache Spark and IBM Watson chose to join hands. As the Chief data scientist for IBM Watson IoT, give us a glimpse of what this partnership is set to achieve.
This partnership underpins IBM’s strong commitment to open source. Not only is IBM contributing to Apache Spark, IBM also creates new open source projects on top of it. The most prominent example is Apache SystemML which is also covered in this book. The next three years are dedicated to DeepLearning and AI. And IBM’s open source contributions will help the Apache Spark community to succeed. The most prominent example is PowerAI where IBM outperformed all state-of-the-art deep learning technologies for image recognition.
For someone just starting out in the field of big data and analytics, what would your advice be?
I suggest taking a Machine Learning course of one of the leading online training vendors. Then take a Spark course (or read my book). Finally, try to do everything yourself. Participate in Kaggle competitions and try to replicate papers.