Data

How Facebook uses HyperLogLog in Presto to speed up cardinality estimation

2 min read

Yesterday, Facebook shared how they use HyperLogLog (HLL) in Presto to do computational intensive operations like estimating distinct values within a huge dataset. With this implementation, they were able to achieve up to 1,000x speed improvements while working on count-distinct problems.

What is HyperLogLog?

HyperLogLog is an algorithm designed for estimating the number of unique values within a huge dataset, which is also known as cardinality. To produce an estimate of the cardinality it uses an auxiliary memory of m units and performs a single pass over the data. This algorithm is an improvised version of the previously known cardinality estimator, LogLog.

Facebook uses HypeLogLog in scenarios like determining the number of distinct people visiting Facebook in the past week using a single machine. To even further speed up these type of queries they implemented HLL in Presto, which is an open source distributed SQL query engine. Presto is designed for running interactive analytic queries against data sources of all sizes.

Using HLL, the same calculation can be performed in 12 hours with less than 1 MB of memory. Facebook highlights that they have seen great improvements, with some queries being run within minutes, including those used to analyze thousands of A/B tests.

Presto’s HLL implementation

The implementation of HLL data structures in Presto consists of two layout formats: sparse and dense. To save on memory the storage starts off with a sparse layout and when the input data structure goes over the prespecified memory limit for the sparse format, Presto switches to the dense layout automatically. The sparse layout is used to get almost the exact count in low-cardinality datasets, for instance, the number of distinct countries. The dense layout is used in the cases where the cardinality is high such as the number of distinct users.

There is an HYPERLOGLOG data type in Presto. Those users who prefer a single format so that they can process the output structure in other platforms such as Python, there is another data type called P4HYPERLOGLOG, which starts and stays strictly as a dense HLL.

To read more in detail about how Facebook uses HLL, check out their article.

Read Next

Facebook open-sources PyText, PyTorch based NLP modeling framework

Facebook contributes to MLPerf and open sources Mask R-CNN2Go, its CV framework for embedded and mobile devices

Australia’s ACCC publishes preliminary report recommending Google Facebook be regulated and monitored for discriminatory and anti-competitive behavior

Bhagyashree R

Share
Published by
Bhagyashree R

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago