Virtually smaller than Bloom and Xor
What the research is:
The ribbon filter is a new data structure that is more space efficient than the popular bloom filters that are often used to optimize data retrieval. Bloom and now Ribbon filters solve real technical problems, among other things through a smooth configurability that cannot be achieved by other filters. Bloom filters work by overestimating a set of keys associated with a data resource. With a bloom filter, almost all negative queries to this resource can be skipped (filtered), since the bloom filter rejects almost all query keys that are not associated with the resource.
With the right data layout and design, the ribbon filter is the first Bloom alternative that achieves the almost continuous, safe compromise between accuracy and space requirements of Bloom filters.
Almost continuously here means that every amount of memory is used efficiently to represent any number of keys, so that wasted memory, such as with internal fragmentation, can be minimized to zero. The typical danger for the compromise between accuracy and space is bit alignment, in which some data sizes (e.g. 4, 8 or 16 bits per key) can be accessed more quickly than others. Like Bloom, our data layout for Ribbon is not subject to this risk in terms of access times and is therefore more freely configurable. And ribbon filters give the space-time compromise new freedom of configurability.
Building on some previous lines of research, the ribbon filter combines a simplified, faster, and more flexible design algorithm; a data layout optimized for filter queries; and near continuous configurability to provide a practical alternative to static (immutable) Bloom filters.
Although sophisticated bloom filters are extremely fast, they require around 50 percent more space (overhead) than the information-theoretical lower limit for filters on any key. If bloom filters cannot meet the space efficiency goals of an application, ribbon filter variants dominate in the case of space-time compromises with almost continuous configurability and space requirements of only 1 percent or less. Ribbon filters have O (1) query times and save about 1/3 of the memory compared to Bloom filters.
How it works:
Like some related immutable structures used for perfect hashing and maps, ribbon filters are constructed by solving a linear system given by hash functions applied to a set of keys. Each line in the linear system expresses that the query, as a key that XORs the values at a particular set of array indexes, must result in a prescribed value to indicate that it is “in” the set of keys is.
Despite the use of Boolean GF (2) arithmetic, the approach to solving this system of logical equations is to use Gaussian Elimination, which essentially means subtracting equations from one another until you can isolate variables (unknowns). If a solution exists, this approach will find it.
The name ribbon has a double meaning. First we use a linear system from Dietzfelbinger and Walzer, whose sorted coefficient matrix resembles a physical band or a wave-like approximation of a band matrix. Gaussian elimination is fundamentally more efficient on this system, since it already comes close to a reduced form.
Ribbon also stands for Rapid Incremental Boolean Banding ON the fly, this is the name of our faster and more flexible new Gaussian solver. Using an approach similar to pasting it into a linear hash table, Ribbon performs Gaussian elimination on the fly. This saves time and space during construction, since line reductions can be carried out in registers instead of in memory and because the reduced form of the band coefficient matrix – a band matrix – is more space-saving than the explicit representation of the band shape. The on-the-fly design also opens up a solution to the core challenge of the ribbon approach: scaling its space efficiency to a very large number of keys.
Why it matters:
On a Facebook scale, we expect ribbon filters to save several percent of RAM resources, with a tiny increase in CPU usage for some large storage systems. However, we do not implement efficiency gains at every engineering cost, so a user-friendly data structure is also important. This problem has blocked the implementation of other Bloom alternatives that offer some space savings.
The ribbon filter opens up these new compromises without introducing significant discontinuities or dangers in the configuration space. In other words, making ribbon filters generic and highly configurable is a bit complicated, but those details can be hidden behind a relatively simple API. You are essentially free to choose between three of the four core performance dimensions – number of buttons added to the set, memory usage, CPU efficiency and accuracy – and the accuracy is automatically optimized.
Read the full paper:
Band filter: Practically smaller than Bloom and Xor
Comments are closed.