2 min read

The team at Apache Hadoop released Apache Hadoop 3.2.0, an open source software platform for distributed storage and for processing of large data sets. This version is the first in the 3.2 release line and is not generally available or production ready, yet.

What’s new in Hadoop 3.2.0?

Node attributes support in YARN

This release features Node Attributes that help in tagging multiple labels on the nodes based on their attributes. It further helps in placing the containers based on the expression of these labels. It is not associated with any queue and hence there is no need to queue resource planning and authorization for attributes.

Hadoop submarine on YARN

This release comes with Hadoop Submarine that enables data engineers for developing, training and deploying deep learning models in TensorFlow on the same Hadoop YARN cluster where data resides. It also allows jobs for accessing data/models in HDFS (Hadoop Distributed File System) and other storages. It supports user-specified Docker images and customized DNS name for roles such as tensorboard.$user.$domain:6006.

Storage policy satisfier

Storage policy satisfier supports HDFS applications to move the blocks between storage types as they set the storage policies on files/directories. It is also a solution for decoupling storage capacity from compute capacity.

Enhanced S3A connector

This release comes with support for an enhanced S3A connector, including better resilience to throttled AWS S3 and DynamoDB IO.

ABFS filesystem connector

It supports the latest Azure Datalake Gen2 Storage.

Major improvements

  • jdk1.7 profile has been removed from hadoop-annotations module.
  • Redundant logging related to tags have been removed from configuration.
  • ADLS connector has been updated to use the current SDK version (2.2.7).
  • This release includes LocalizedResource size information in the NM download log for localization.
  • This version of Apache Hadoop comes with ability to configure auxiliary services from HDFS-based JAR files.
  • This release comes with the ability to specify user environment variables, individually.
  • The debug messages in MetricsConfig.java have been improved.
  • Capacity scheduler performance metrics have been added.
  • This release comes with added support for node labels in opportunistic scheduling.

Major bug fixes

  • The issue with logging for split-dns multihome has been resolved.
  • The snapshotted encryption zone information in this release is immutable.
  • A shutdown routine has been added in HadoopExecutor for ensuring clean shutdown.
  • Registry entries have been deleted from ZK on ServiceClient.
  • The javadoc of package-info.java has been improved.
  • NPE in AbstractSchedulerPlanFollower has been fixed.

To know more about this release, check out the release notes on Hadoop’s official website.

Read Next

Why did Uber created Hudi, an open source incremental processing framework on Apache Hadoop?

Uber’s Marmaray, an Open Source Data Ingestion and Dispersal Framework for Apache Hadoop

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]