On Thursday, the team at Facebook open sourced F14, an algorithm for faster and memory-efficient hash tables. F14 helps the hash tables provide a faster way for maintaining a set of keys or map keys to values, even if the keys are objects, like strings.
The team at Facebook aimed at simplifying the process of selecting the right hash table with the help of F14. The algorithm F14 focuses on the 14-way probing hash table within Folly, Facebook’s open source library of C++ components. These F14 hash tables are better in performance as compared to the previous tools the team had. According to Facebook F14 is can be used as default for all sorts of use cases.
There are many factors that need to be considered while choosing a C++ hash table. These factors could be about keeping long-lived references or pointers to the entries, the size of the keys, the size of the tables, etc. The team at Facebook suggests if the developers aren’t planning to keep long-lived references to entries then they may start with the option- folly::F14FastMap/Set or, they can opt for folly::F14NodeMap/Set.
Challenges with the existing hash table algorithms
Usually the hash tables start by computing a numeric hash code for each key and uses that number for indexing into an array. The hash code for a key remains the same but the hash codes for different keys are different. These keys in a hash table are distributed randomly across the slots in the array. So there are chances of collisions between the keys that map to the same array.
Using Chaining method
Most of the hash table algorithms handle these collisions by using the chaining method. The chaining method uses a secondary data structure such as a linked list for storing all the keys for a slot. This helps in storing the keys directly in the main array and further checking new slots if there is a collision. If the number of keys are divided by the size of the main array, we get a number called the load factor which the measure of the hash table’s fullness. By decreasing the load factor and making the main array larger, it’s possible to reduce the number of collisions. The problem with this method is that it will waste memory.
Using STL method
The next method is using the standard template library (STL) for C++ that provides hash tables via std::unordered_map and std::unordered_set. The method does guarantee reference stability which means the references and pointers to the keys and values in the hash table remains valid until the corresponding key is removed. Thus the entries must be indirect and individually allocated but that would add a substantial CPU overhead.
Folly comes with a fast C++ class without reference stability and a slower C++ class that allocates each entry in a separate node. According to Facebook, the node-based version is not fully standard compliant, but it is still compatible with the standard version for all the codes.
F14 reduces collisions with vector instructions
F14 provide practical improvements for both performance and memory by using the vector instructions available on modern CPUs. F14 uses the hash code for mapping the keys to a block of slots instead of to a single slot and then searches within the chunk in parallel. For this intra-chunk search, F14 uses vector instructions (SSE2 or NEON) that filters all the slots of the chunk at the same time.
The team at Facebook has named this algorithm as F14 because it filters 14 slots at once. This F14 algorithm performs collision resolution in case a chunk overflows or if two keys both pass the filtering step. With F14 there is a low probability of collision taking place within instruction pipelining. The team used Chunking strategy for lowering the collision rate. To explain this better, the chance that 15 of the table’s keys would map to a chunk with 14 slots is quite lower than the chance that two keys would map to one slot.
For instance, imagine you are in a room with 180 people. The chance that one other person has the same birthday as you is about 50 percent, but the chance that there are 14 people who were born in the same fortnight as you is much lower than 1 percent. Chunking keeps the collision rate low even for load factors above 80 percent. Even if there were 300 people in the room, the chance of a fortnight “overflow” is still less than 5 percent.
F14 uses reference-counted tombstones for empty slots
Most of the strategies for reducing the collisions keep looking for an empty slot until they find one but that is a little difficult to execute. In this case, the algorithm should either leave a tombstone, an empty slot that doesn’t terminate the probe search or it has to slide down the later keys in the probe sequence. This is again very complex and difficult to execute. In workloads that mix insert and erase, the tombstones can get accumulated. And accumulated tombstones increase the load factor from the performance perspective.
F14 uses a strategy that acts similar to reference-counted tombstones. This strategy is based on an auxiliary bit for each slot suggested by Amble and Knuth in their 1974 article “Ordered hash tables.” A bit is set whenever their insertion routine passes a slot that is already full and the bit records that a slot has overflowed. A tombstone corresponds to an empty slot with the overflow bit set. The overflow bit makes searches faster as the search for a key can be stopped at a full slot where the overflow bit is clear, even if the following slot is not empty.
The team at Facebook has worked towards having the count the number of active overflows. The overflow bits are set when a displaced key is inserted rather than when the key that did the displacing is removed. This makes it easy to keep track of the number of keys relying on an overflow bit. Each of the F14 chunks uses 1 byte of metadata for counting the number of keys that wanted to be placed in the chunk but are currently stored in a different chunk. And when a key gets erased, it decrements the overflow counter on all the chunks that are on its probe sequence by cleaning them up.
F14 optimizes memory effectively
It is important to reduce memory waste for improving performance and allowing more of a program’s data to fit in the cache. The two common strategies used for hash table memory layouts are indirect which uses a pointer stored in the main array and direct which uses a memory of the keys and values incorporated directly into the main hash array.
F14 uses pointer-indirect (F14Node) and direct storage (F14Value) versions, and index-indirect (F14Vector). The Facebook team uses STL container std::unordered_set that never wastes any data space as it waits until the last moment to allocate nodes. The F14NodeSet executes a separate memory allocation for every value, like std::unordered_set. It also stores pointers to the values in chunks and uses F14’s probing collision resolution to ensure that there are no chaining pointers and no per-insert metadata. The F14ValueSet stores the values inline and has lesser data waste due to using a higher maximum load factor. F14ValueSet hence achieves memory-efficiency easily.
While testing, the team encountered unit test failures and production crashes. Though the team has randomized the code for debug builds. F14 now randomly chooses among all the empty slots in a chunk when inserting a new key. Also, the entry order isn’t completely randomized and according to the team, the shuffle is good enough to catch a regressing test with only a few runs.
To know more about this news, check out Facebook’s post.