In the process of rebuilding its Big Data platform, Uber created an open-source Spark library named Hadoop Upserts anD Incremental (Hudi). This library permits users to perform operations such as update, insert, and delete on existing Parquet data in Hadoop. It also allows data users to incrementally pull only the changed data, which significantly improves query efficiency. It is horizontally scalable, can be used from any Spark job, and the best part is that it only relies on HDFS to operate.
Why is Hudi introduced?
Uber studied its current data content, data access patterns, and user-specific requirements to identify problem areas. This research revealed the following four limitations:
Scalability limitation in HDFS
Many companies who use HDFS to scale their Big Data infrastructure face this issue. Storing large numbers of small files can affect the performance significantly as HDFS is bottlenecked by its NameNode capacity. This becomes a major issue when the data size grows above 50-100 petabytes.
Need for faster data delivery in Hadoop
Since Uber operates in real time, there was a need for providing services the latest data. It was important to make the data delivery much faster, as the 24-hour data latency was way too slow for many of their use cases.
No direct support for updates and deletes for existing data
Uber used snapshot-based ingestion of data, which means a fresh copy of source data was ingested every 24 hours. As Uber requires the latest data for its business, there was a need for a solution which supports update and delete operations for existing data. However, since their Big Data is stored in HDFS and Parquet, direct support for update operations on existing data is not available.
Faster ETL and modeling
ETL and modeling jobs were also snapshot-based, requiring their platform to rebuild derived tables in every run. ETL jobs also needed to become incremental to reduce data latency.
How Hudi solves the aforementioned limitations?
The following diagram shows Uber’s Big Data platform after the incorporation of Hudi:
Regardless of whether the data updates are new records added to recent date partitions or updates to older data, Hudi allows users to pass on their latest checkpoint timestamp and retrieve all the records that have been updated since. This data retrieval happens without running an expensive query that scans the entire source table.
Using this library Uber has moved to an incremental ingestion model leaving behind the snapshot-based ingestion. As a result, the data latency was reduced from 24 hrs to less than one hour.
To know about Hudi in detail, check out Uber’s official announcement.