ELI5: Zstandard – Smaller and quicker information compression
In this blog post we explain Zstandard (ZSTD), a fast data compression algorithm that offers the best performance, in a very easy to understand way. If you are interested in learning by watching or listening, watch a video about this open source project on our Facebook Open Source YouTube channel.
Zstandard (ZSTD) is a fast, lossless compression algorithm. It offers high compression ratios as well as great compression and decompression speeds, providing the best performance in many conventional situations. In addition, ZSTD now has a number of features that enable many real scenarios that were previously difficult to implement for compressors.
There are three standard metrics for comparing compression algorithms and implementations:
- Compression Ratio: A ratio between the original size and the compressed size.
- Compression speed: How quickly the data can be reduced in size, measured in MB / s.
- Decompression speed: How quickly the original data can be reconstructed from the compressed data, measured in MB / s.
Many of the algorithms in use today focus on one of the above metrics or try to find a balance between them. Several fast compression algorithms have been tested and compared and as shown in the figure below, there are often drastic tradeoffs between speed and size (source).
The fastest algorithm, Iz4 1.9.2, results in lower compression rates; the one with the highest compression rate (except ZSTD), zlib 1.2.11-1, suffers from slow compression speed. However, ZSTD shows significant improvements in both compression speed and decompression speed while maintaining a high compression ratio. Note that the negative compression levels indicated with –fast = X provide faster compression and decompression speeds in exchange for some loss in compression ratio compared to level 1.
As shown in the table below, ZSTD offers a very wide range of speed / compression tradeoffs, allowing ZSTD to trade compression speeds for better compression ratios, and vice versa. ZSTD can provide these speeds because it is supported by an extremely fast decoder (source).
However, most of these results apply to typical file and stream scenarios that are typically several MB in size. Smaller data is handled a little differently.
In general, the smaller the amount of data to be compressed, the more difficult it is to compress it. Compression algorithms learn from past data how future data will be compressed. At the start of a new data set, there is no past data to build on, which complicates the challenge. To solve this problem, ZSTD offers a special training mode with which the algorithm can be tuned to a selected data type. A dictionary is generated from the results of this training, which helps to capture general patterns in the data. This dictionary must be loaded before compression and decompression. Once the patterns are captured, the dictionary assumes that future dates will be similar and begins compressing. Using this dictionary will dramatically improve the compression ratio for small data, as shown in the following graphic (source).
The type of data that is compressed can also affect these metrics. Many algorithms are tailored to specific data types, such as B. English text, genetic sequences or rasterized images. However, ZSTD is intended for general compression for a wide variety of data types.
Where is ZSTD used?
ZSTD went open source in 2016 and is used continuously to compress large amounts of data in multiple formats in Facebook’s development servers, data warehouses, databases and compressed file systems as a powerful and flexible compressor engine. To better understand where ZSTD is being used, check out this Facebook engineering blog that explains how Facebook improved compression with ZSTD on a large scale.
ZSTD is used by Linux, FreeBSD, Amazon Web Services, and many more. For a detailed list of the industries in which ZSTD is used, please visit their website.
Where can I find out more?
ZSTD has an extensive collection of APIs and supports a number of popular programming languages. To learn more about ZSTD visit their website which has great information on the benchmarks and the different languages supported. To learn more about using this algorithm, build instructions, and testing, visit the project’s Github page. A detailed API reference can be found in the documentation.
If you have any further questions about ZSTD, please let us know on our YouTube channel or on Twitter. We always want to hear from you, and we hope you find this open source project and the ELI5 series useful.
About the ELI5 series
In a series of short videos, one of our developer advocates in the Facebook Open Source Team explains a Facebook Open Source project in an easy-to-understand and user-friendly manner.
For each of these videos we will write an accompanying blog post (like the one you are reading now), which you can find on our YouTube channel.
To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.
Interested in working with open source on Facebook? Check out our open source related job postings on our careers page by taking this short survey.