2 min read

Last week, Apache Spark released its latest version, Apache Spark 2.4.0. It is the fifth release in the 2.x line.

This release comes with Barrier Execution Mode for better integration with deep learning frameworks. Apache Spark 2.4.0 brings 30+ built-in and higher-order functions to deal with complex data types. These functions work with  Scala 2.12 and improve the K8s (Kubernetes) integration. This release also focuses on usability, stability, and polish while resolving around 1100 tickets.

What’s new in Apache Spark 2.4.0?

  • Built-in Avro data source
  • Image data source
  • Flexible streaming sinks
  • Elimination of the 2GB block size limitation during transfer
  • Pandas UDF improvements

Major changes

  • Apache Spark 2.4.0 supports Barrier Execution Mode in the scheduler, for better integration with deep learning frameworks.
  • One can now build Spark with Scala 2.12 and write Spark applications in Scala 2.12.
  • Apache Spark 2.4.0 supports Spark-Avro package with logical type support for better performance and usability.
  • Some users are SQL experts but aren’t much aware of Scala/Python or R. Thus, this version of Apache comes with support for Pivot.
  • Apache Spark 2.4.0 has added Structured Streaming ForeachWriter for Python. This lets users write ForeachWriter code in Python, that is, they can use the partitionId and the version/batchId/epochId to conditionally process rows.
  • This new release has also introduced Spark data source for the image format. Users can now load images through the Spark source reader interface.

Bug fixes:

  • The LookupFunctions are used to check the same function name again and again. This version includes a latest LookupFunctions rule which performs a check for each invocation.
  • A PageRank change in the Apache Spark 2.3 introduced a bug in the ParallelPersonalizedPageRank implementation. This change prevents serialization of a Map which needs to be broadcast to all workers. This issue has been resolved with the release of Apache Spark 2.4.0

Read more about Apache Spark 2.4.0 on the official website of Apache Spark.

Read Next

Building Recommendation System with Scala and Apache Spark [Tutorial]

Apache Spark 2.3 now has native Kubernetes support!

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety