lz4_flex 0.11: Gainzzzzz Unleashed!
lz4_flex is a LZ4 compression library written in Rust.
lz4_flex 0.11 has been unleashed, and since the sole purpose of this blog is memes, I’ll summarize the performance gainnzzzz as:
I mean look at all the z in gainnzzzz. It has to be really fast.
Benchmarks
Setup
AMD Ryzen 7 5900HX, rustc 1.69.0 (84c898d65 2023-04-16), Manjaro, CPU Boost Disabled, CPU: 3Ghz
Tests can be run on the lz4_flex repository with cargo bench
.
In the benchmarks below all tests are done with out-of-bounds checks. All APIs are sound.
Block Format
The block format is ideal for small data up to several megabytes. The lz4_flex crate is tested in two configuration variants:
- no suffix: contains
unsafe
. - safe suffix: contains no
unsafe
.
Frame Format
The frame format is suitable for streaming large data. That is reflected on the test data for the benchmarks.
A frame consists of concatenated blocks. Blocks in the frame format can be independent or linked for higher compression.
The tests below are from the lz4_flex variants that contain unsafe
.
Performance Improvements
In this release I took some effort to document the changes to improve performance. Maybe you can apply the insights to your own code.
Custom memcpy
Calls to memcpy
are replaced with a custom memcpy implementation, similar to the wildcopy of 16 bytes.
The function also employs a double copy trick to reduce the number of instructions the CPU has to execute.
Check it out here: https://github.com/PSeitz/lz4_flex/blob/main/src/fastcpy.rs
Optimize Wildcopy
The initial check in the 16-byte wild copy is unnecessary since it is already done before calling the method. Likely irrelevant for performance, but in general: avoid unnecessary work.
- PR: #109
Faster duplicate_overlapping
Replace the aggressive compiler unrolling after the failed attempt #69 (wrote out of bounds in some cases). The compiler will unroll/auto-vectorize a simple byte-by-byte copy with a lot of branches. This is not what we want, as large overlapping copies are not that common. The unrolling is avoided by manually unrolling with a less aggressive version. Decompression performance is slightly improved by ca 4%, except for the smallest test case.
- PR: #114
Simplify extend_from_within_overlapping
extend_from_within_overlapping
is used in safe decompression when overlapping data has been detected.
The previous version had an unnecessary assertion, which did not hoist the bounds checks.
Removing the temporary &mut
slice also simplifies the resulting assembly.
uiCA Code Analyzer can analyze assembly to estimate the number of cycles for different architectures. If you use it, be aware that the assembly may drastically change with inlining.
Variant | Skylake | IceLake | Tiger Lake | Rocket Lake |
---|---|---|---|---|
uiCA Cycles Prev | 28.71 | 30.67 | 28.71 | 27.57 |
uiCA Cycles Simplified | 13.00 | 15.00 | 13.00 | 11.00 |
- PR: #72
Improve Safe Decompression Performance 8-18%
Reduce multiple slice fetches: every slice access, also nested ones, carries some overhead. In the hot loop, a fixed &[u8;16]
is fetched to operate on.
This is purely done to pass that info to the compiler. Remove error handling that only carries overhead. As we are in safe mode, we can rely on bounds checks if custom
error handling only adds overhead. In normal operation, no error should occur.
The strategy to identify improvements was by counting the lines of assembly. A rough heuristic, but seems effective.
cargo asm --release --example decompress_block decompress_internal | wc -l
- PR: #73
Improve Safe Frame Compression Performance 7-15%
The frame encoding uses a fixed size hashtable. By creating a special hashtable with a Box<[u32; 4096]>
size, in combination with the bit shift of 4,
which is also moved into a constant, the compiler can remove the bounds checks. For that to happen, the compiler also needs to recognize the >> 48
right shift
from the hash algorithm (u64 >> 52 <= 4096), which is the case. Yey. It also means we can use less unsafe
for the unsafe variant.
- PR: #74
Switch To Use Only 3 Kinds Of Hashtable
Use only hashtables with fixed sizes and bit shifts that allow removing bounds checks.
- PR: #77
Why is the safe variant slower? 😤😤😤
The safe variant’s performance has significantly improved due to ongoing effort and compiler optimizations. It is still lagging behind though and there’s only a few reasons for that:
-
Bounds checks: Each slice access in the safe variant requires bounds checks, if they can’t be elided. The
unsafe
variant can avoid them by usingget_unchecked
or similar. It can be difficult to convince the compiler to eliminate bounds checks. -
slice_index_order: Do you know bounds checks evil twin brother
slice_index_order
? For any slice access with addition the compiler has to assume that value may wrap causing index order failure.
data[pos..pos + 10] // bounds check + slice_index_order check
slice_index_order
checks can be avoided by proving a value can’t wrap, e.g.
pos_u32 as usize + 10
UPDATE: A fix in LLVM has been merged, soon you should be able to simply use saturating_add
.
- LLVM Assembly: If the safe version suffers from the presence of bounds checks, it limits the optimizations the compiler can perform without altering the logic. Reordering the code can’t be done if it reorders a bounds check and therefore a potentional panic. Additionally, there are cases where LLVM generates suboptimal assembly code, further impacting performance. I think the reason for that is that the rust compiler generates a lot of llvm IR for bounds checks.
To provide an example where bounds checks hoisting was unable to be performed, this backwards loop does two bounds checks per iteration for every byte copied. That’s a lot of overhead.
pub fn extend_from_within_overlapping(output: &mut [u8], pos: &mut usize, start: usize, num_bytes: usize) {
let offset = *pos - start;
for i in start + offset..start + offset + num_bytes {
output[i] = output[i - offset]; // Two bounds checks
}
*pos += num_bytes;
}
So if you are up for the challenge to try to remove the bounds checks, it’s all ready for you: https://godbolt.org/z/bscqYModz
New Features
Allow to pass buffer larger than size
This removes an unnecessary check in the decompression, when the passed buffer is too big. #78.
Add auto_finish to FrameEncoder
Adds auto_finish
to FrameEncoder
, which flushes on drop. This is useful if the flush method is abstracted away in a Box<dyn Writer>
.
#95 #100.
Autodetect frame blocksize
The default blocksize of FrameInfo is now auto instead of 64kb, it will detect the blocksize depending of the size of the first write call. This increases compression ratio and speed for use cases where the data is larger than 64kb. #81
Add fluent API style contruction for FrameInfo
This adds in fluent API style construction for FrameInfo (thanks @CosmicHorrorDev). #99.
let info = FrameInfo::new()
.block_size(BlockSize::Max1MB)
.content_checksum(true);
Handle empty input
Empty input was ignored previously and didn’t write anything. Now an empty Frame is written. This improves compatibility with the reference implementation and some corner cases. #120.
Breaking Changes
lz4_flex 0.10
and before had a feature flag checked-decode
to do additional checks to avoid out-of-bounds access. 0.11
inverts this feature flag to unchecked-decode
.
Typically features flags are additive in Rust, so enabling a feature flag will add functionality. In the case of checked-decode
, there was a unwanted side-effect.
default-features=false
caused the API to be unsound. With unchecked-decode
it requires deliberate action to enable the unsound API where the caller know its safe to call.
Update: 0.11.1
has been released, which removes unchecked-decode
, since the removed checks could spill to other parts of the code due to feature-unification.
The API is now always sound.
Changelog
See the full changelog
If you want to comment here is the discussion on reddit.