Databricks announces Databricks Runtime 4.2 with numerous updates and added components on Spark internals, Databricks Delta and improvisions to its previous version.
The databricks runtime 4.2 is powered with Apache Spark 2.3 and recommended for its quick adoption to enjoy the upcoming GA release of Databricks Delta.
Databricks Runtime is a set of software artifacts which runs on the clusters of machines and improves the usability and performance of big data analytics.
New Features of Databricks Runtime 4.2
- Added Multi-cluster writing support, enabling users to use the transactional writing features from Databricks Delta.
- Streams getting recorded directly to the registered table on Databricks Delta. These streams are stored in the Hive metastore of Databricks Delta platform using df.writeStream.table(…).
- Added new streaming foreachBatch() for Scala. This helps to define a function for processing output of every micro batch using DataFrame operations.
- Added support for streaming foreach() for Python language which was earlier available only to Scala.
- Added from_avro/to_avro functions to support read/write Avro data within DataFrame.
Improvements
- All commands and queries of Databricks Delta support referring to a table using its path as an identifier (that is, delta.`/path/to/table`).
- DESCRIBE HISTORY includes commit ID and is now ordered newest to oldest by default.
Bug Fixes
- Partition-based filtering predicates operate correctly for special cases like when the predicates differ from the table.
- Fixed missing column AnalysisException for performing better equality checks on boolean columns in Databricks Delta tables i.e. booleanValue = true.
- Stopped modifying transaction log while using CREATE TABLE for creating a pointer to an existing table. This prevents unnecessary conflicts with concurrent streams and allows the creation of metastore pointer to tables where the user only has read access to the data.
- Stopped causing Out Of Memory in the driver while Calling display() on a stream with large amounts of data.
- Fixed truncation of long lineages which were earlier causing StackOverFlowError while updating the state of a Databricks Delta table.
For more details, please read the release notes officially documented by Databricks.
Read Next
Databricks open sources MLflow, simplifying end-to-end Machine Learning Lifecycle
Project Hydrogen: Making Apache Spark play nice with other distributed machine learning frameworks