How Facebook uses HyperLogLog in Presto to speed up cardinality estimation

2 min read

Yesterday, Facebook shared how they use HyperLogLog (HLL) in Presto to do computational intensive operations like estimating distinct values within a huge dataset. With this implementation, they were able to achieve up to 1,000x speed improvements while working on count-distinct problems.

What is HyperLogLog?

HyperLogLog is an algorithm designed for estimating the number of unique values within a huge dataset, which is also known as cardinality. To produce an estimate of the cardinality it uses an auxiliary memory of m units and performs a single pass over the data. This algorithm is an improvised version of the previously known cardinality estimator, LogLog.

Facebook uses HypeLogLog in scenarios like determining the number of distinct people visiting Facebook in the past week using a single machine. To even further speed up these type of queries they implemented HLL in Presto, which is an open source distributed SQL query engine. Presto is designed for running interactive analytic queries against data sources of all sizes.

Using HLL, the same calculation can be performed in 12 hours with less than 1 MB of memory. Facebook highlights that they have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.

Presto’s HLL implementation

The implementation of HLL data structures in Presto consists of two layout formats: sparse and dense. To save on memory the storage starts off with a sparse layout and when the input data structure goes over the prespecified memory limit for the sparse format, Presto switches to the dense layout automatically. The sparse layout is used to get almost the exact count in low-cardinality datasets, for instance, the number of distinct countries. The dense layout is used in the cases where the cardinality is high such as the number of distinct users.

There is an HYPERLOGLOG data type in Presto. Those users who prefer a single format so that they can process the output structure in other platforms such as Python, there is another data type called P4HYPERLOGLOG, which starts and stays strictly as a dense HLL.

To read more in detail about how Facebook uses HLL, check out their article.