Uber announced the details regarding its new and open source real-time analytics engine, AresDB, yesterday. AresDB, released in November 2018, is Uber’s new solution for real-time analytics that helps unify, simplify, and improve Uber’s real-time analytics database solutions. It makes use of graphics processing units (GPUs) and an unconventional power source to help analytics grow at scale.
AresDB’s architecture explores features such as column-based storage with compression (for storage efficiency), real-time ingestion with upsert support (for high data accuracy), and GPU powered query processing (for highly parallelized data processing powered by GPU).
Let’s have a look at these key features of AresDB.
Column-based storage
AresDB stores data in a columnar format. Values in each column get stored as a columnar value vector. Nullness/validity of any values within the columns get stored in a separate null vector, where the validity of each of these values is represented by one bit. There are two types of stores, namely live and archive store, where column-based storage of data takes place.
Live Store
This is where all the uncompressed and unsorted columnar data (live vectors) gets stored. Data records in these live stores are then further partitioned into (live) batches of configured capacity. The values of each column within a batch are stored as a columnar vector. Validity/nullness of the values in each of these value vectors gets stored as a separate null vector, where the validity of each value is represented by one bit.
Archive Store
AresDB also stores all the mature, sorted, and compressed columnar data (archive vectors) in an archive store with the help of fact tables (stores an infinite stream of time series events). Records in archive store are also partitioned into batches similar to the live store. However, unlike the live batches, an archive batch is created that contains records of a particular Universal Time Coordinated (UTC) day. Records are then sorted as per the user configured column sort order.
Real-time ingestion with upsert support
Under the real-time ingestion feature, clients ingest data using the ingestion HTTP API by posting an upsert batch. Upsert batch refers to custom and serialized binary format that helps minimize the space overhead while also keeping the data randomly accessible.
Real-time ingestion with upsert support
After AresDB receives an upsert batch for ingestion, the upsert batch first gets written to redo logs for recovery. Then the upsert batch gets appended to the end of the redo log, where AresDB identifies and skips “late records” (where event time older than archived cut-off time) for ingestion into the live store. For records that are not “late,” AresDB uses the primary key index that helps locate the batch within the live store. During ingestion, once the upsert batch gets appended to the redo log, “late” records get appended to a backfill queue and other records are applied to the live store.
GPU-powered query processing
The user needs to use Ares Query Language (AQL) created by Uber to run queries against AresDB. AQL is a time series analytical query language that uses JSON, YAML, and Go objects. It is unlike the standard SQL syntax of SELECT FROM WHERE GROUP BY like other SQL-like languages. AQL offers better programmatic query experience as compared to SQL in JSON format for dashboard and decision system developers. This is because JSON format allows the developers to easily manipulate queries using the code, without worrying about issues such as SQL injection.
AresDB manages multiple GPU devices with the help of a device manager that models the GPU device resources in two dimensions, namely, GPU threads and device memory. This helps track GPU memory consumption as processing queries. After query compilation is done, AresDB helps users estimate the number of resources required for the execution of a query.
Device memory requirements need to be satisfied before a query is allowed to start. AresDB is currently capable of running either one or several queries on the same GPU device as long as the device is capable of satisfying all of the resource requirements.
Future work
AresDB is open sourced under the Apache License and is currently being used widely at Uber to power its real-time data analytics dashboards, helping it make data-driven decisions at scale. In the future, the Uber team wants to improve the project by adding new features. These new features include building the distributed design of AresDB to improve its scalability and reduce the overall operational costs.
Uber also wants to add developer support and tooling to help developers quickly integrate AresDB into their analytics stack. Other features include expanding the feature set, and Query engine optimization.
For more information, check out the official Uber announcement.
Read Next
Uber to restart its autonomous vehicle testing, nine months after the fatal Arizona accident