lz4_flex 0.11: Gainzzzzz Unleashed!

A blog about Rust, Compression, and other things.

lz4_flex is a LZ4 compression library written in Rust.

lz4_flex 0.11 has been unleashed, and since the sole purpose of this blog is memes, I’ll summarize the performance gainnzzzz as:

I mean look at all the z in gainnzzzz. It has to be really fast.

Benchmarks

Setup

AMD Ryzen 7 5900HX, rustc 1.69.0 (84c898d65 2023-04-16), Manjaro, CPU Boost Disabled, CPU: 3Ghz

Tests can be run on the lz4_flex repository with cargo bench.

In the benchmarks below all tests are done with out-of-bounds checks. All APIs are sound.

Block Format

The block format is ideal for small data up to several megabytes. The lz4_flex crate is tested in two configuration variants:

no suffix: contains unsafe.
safe suffix: contains no unsafe.

Frame Format

The frame format is suitable for streaming large data. That is reflected on the test data for the benchmarks. A frame consists of concatenated blocks. Blocks in the frame format can be independent or linked for higher compression. The tests below are from the lz4_flex variants that contain unsafe.

Performance Improvements

In this release I took some effort to document the changes to improve performance. Maybe you can apply the insights to your own code.

Custom memcpy

Calls to memcpy are replaced with a custom memcpy implementation, similar to the wildcopy of 16 bytes. The function also employs a double copy trick to reduce the number of instructions the CPU has to execute. Check it out here: https://github.com/PSeitz/lz4_flex/blob/main/src/fastcpy.rs

PR: #113, #112

Optimize Wildcopy

The initial check in the 16-byte wild copy is unnecessary since it is already done before calling the method. Likely irrelevant for performance, but in general: avoid unnecessary work.

PR: #109

Faster duplicate_overlapping

Replace the aggressive compiler unrolling after the failed attempt #69 (wrote out of bounds in some cases). The compiler will unroll/auto-vectorize a simple byte-by-byte copy with a lot of branches. This is not what we want, as large overlapping copies are not that common. The unrolling is avoided by manually unrolling with a less aggressive version. Decompression performance is slightly improved by ca 4%, except for the smallest test case.

PR: #114

Simplify extend_from_within_overlapping

extend_from_within_overlapping is used in safe decompression when overlapping data has been detected. The previous version had an unnecessary assertion, which did not hoist the bounds checks. Removing the temporary &mut slice also simplifies the resulting assembly.

uiCA Code Analyzer can analyze assembly to estimate the number of cycles for different architectures. If you use it, be aware that the assembly may drastically change with inlining.

Variant	Skylake	IceLake	Tiger Lake	Rocket Lake
uiCA Cycles Prev	28.71	30.67	28.71	27.57
uiCA Cycles Simplified	13.00	15.00	13.00	11.00

PR: #72

Improve Safe Decompression Performance 8-18%

Reduce multiple slice fetches: every slice access, also nested ones, carries some overhead. In the hot loop, a fixed &[u8;16] is fetched to operate on. This is purely done to pass that info to the compiler. Remove error handling that only carries overhead. As we are in safe mode, we can rely on bounds checks if custom error handling only adds overhead. In normal operation, no error should occur.

The strategy to identify improvements was by counting the lines of assembly. A rough heuristic, but seems effective. cargo asm --release --example decompress_block decompress_internal | wc -l

PR: #73

Improve Safe Frame Compression Performance 7-15%

The frame encoding uses a fixed size hashtable. By creating a special hashtable with a Box<[u32; 4096]> size, in combination with the bit shift of 4, which is also moved into a constant, the compiler can remove the bounds checks. For that to happen, the compiler also needs to recognize the >> 48 right shift from the hash algorithm (u64 >> 52 <= 4096), which is the case. Yey. It also means we can use less unsafe for the unsafe variant.

PR: #74

Switch To Use Only 3 Kinds Of Hashtable

Use only hashtables with fixed sizes and bit shifts that allow removing bounds checks.

PR: #77

Why is the safe variant slower? 😤😤😤

The safe variant’s performance has significantly improved due to ongoing effort and compiler optimizations. It is still lagging behind though and there’s only a few reasons for that:

Bounds checks: Each slice access in the safe variant requires bounds checks, if they can’t be elided. The unsafe variant can avoid them by using get_unchecked or similar. It can be difficult to convince the compiler to eliminate bounds checks.
slice_index_order: Do you know bounds checks evil twin brother slice_index_order? For any slice access with addition the compiler has to assume that value may wrap causing index order failure.

data[pos..pos + 10] // bounds check + slice_index_order check

Flex Info

slice_index_order checks can be avoided by proving a value can’t wrap, e.g. pos_u32 as usize + 10

UPDATE: A fix in LLVM has been merged, soon you should be able to simply use saturating_add.

LLVM Assembly: If the safe version suffers from the presence of bounds checks, it limits the optimizations the compiler can perform without altering the logic. Reordering the code can’t be done if it reorders a bounds check and therefore a potentional panic. Additionally, there are cases where LLVM generates suboptimal assembly code, further impacting performance. I think the reason for that is that the rust compiler generates a lot of llvm IR for bounds checks.

To provide an example where bounds checks hoisting was unable to be performed, this backwards loop does two bounds checks per iteration for every byte copied. That’s a lot of overhead.

pub fn extend_from_within_overlapping(output: &mut [u8], pos: &mut usize, start: usize, num_bytes: usize) {
    let offset = *pos - start;
    for i in start + offset..start + offset + num_bytes {
        output[i] = output[i - offset]; // Two bounds checks
    }
    *pos += num_bytes;
}

So if you are up for the challenge to try to remove the bounds checks, it’s all ready for you: https://godbolt.org/z/bscqYModz

New Features

Allow to pass buffer larger than size

This removes an unnecessary check in the decompression, when the passed buffer is too big. #78.

Add auto_finish to FrameEncoder

Adds auto_finish to FrameEncoder, which flushes on drop. This is useful if the flush method is abstracted away in a Box<dyn Writer>. #95 #100.

Autodetect frame blocksize

The default blocksize of FrameInfo is now auto instead of 64kb, it will detect the blocksize depending of the size of the first write call. This increases compression ratio and speed for use cases where the data is larger than 64kb. #81

Add fluent API style contruction for FrameInfo

This adds in fluent API style construction for FrameInfo (thanks @CosmicHorrorDev). #99.

let info = FrameInfo::new()
    .block_size(BlockSize::Max1MB)
    .content_checksum(true);

Handle empty input

Empty input was ignored previously and didn’t write anything. Now an empty Frame is written. This improves compatibility with the reference implementation and some corner cases. #120.

Breaking Changes

lz4_flex 0.10 and before had a feature flag checked-decode to do additional checks to avoid out-of-bounds access. 0.11 inverts this feature flag to unchecked-decode. Typically features flags are additive in Rust, so enabling a feature flag will add functionality. In the case of checked-decode, there was a unwanted side-effect. default-features=false caused the API to be unsound. With unchecked-decode it requires deliberate action to enable the unsound API where the caller know its safe to call.

Update: 0.11.1 has been released, which removes unchecked-decode, since the removed checks could spill to other parts of the code due to feature-unification. The API is now always sound.

Changelog

See the full changelog

If you want to comment here is the discussion on reddit.