Do modern CPU's have compression instructions

Asked 4/5, 2018 at 22:19 Answered 28/6 at 23:58

I have been curious about this for awhile since compression is used in about everything.

Are there any basic compression support instructions in the silicon on a typical modern CPU chip?
If not, why are they not included?
Why is this different from encryption, where some CPUs have hardware support for algorithms such as AES?

Kimkimball answered 4/5, 2018 at 22:19 Comment(5)

AES does encryption, not compression: en.wikipedia.org/wiki/Advanced_Encryption_Standard – Ranson 4/5, 2018 at 23:6

Are you asking about compression/decompression or encryption/decryption? – Backandforth 5/5, 2018 at 0:26

@Stephen C What I was referring to is, AES has processor instructions, why not compression instructions to make LZ4 faster (or any other "standard" data compression algorithm) – Kimkimball 5/5, 2018 at 19:53

I have edited your question to correct the ambiguity. – Ranson 6/5, 2018 at 0:53

Which of the 100's of lossless and or lossy compressions should be supported? – Warchaw 6/5, 2018 at 14:42

They don’t have general-purpose compression instructions.

AES operates on very small data blocks, it accepts two 128 bit inputs, does some non-trivial computations on them, produces single 128 bit output. A dedicated instruction to speed up computation helps a lot.

On modern hardware, lossless compression speed is often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already.

If you need to compress many gigabits/second, there’re several standalone accelerators, but these are not parts of processors, usually standalone chips connected to PCIx. And they are very niche products because most users just don't need to compress that much data that fast.

However, modern CPUs have a lot of stuff for lossy multimedia compression. Most of them have multiple vector instruction set extensions (mmx, sse, avx), and some of these instructions help a lot for e.g. video compression use case. For example, _mm_sad_pu8 (SSE), _mm_sad_epu8 (SSE2), _mm256_sad_epu8 (AVX2) are very helpful for estimating compression errors of 8x8 blocks of 8 bit pixels. The AVX2 version processes 4 rows of the block in just a few cycles (5 cycles on Haswell, 1 on Skylake, 2 on Ryzen).

Finally, many CPUs have integrated GPUs which include specialized silicon for hardware video encoding and decoding, usually h.264, newer ones also h.265. Here's a table for Intel GPUs, AMD has separate names for encoding and decoding parts. That silicon is even more power efficient than SIMD instructions in the cores.

Wasting answered 4/5, 2018 at 22:54 Comment(9)

Compression accelerators are included in some processor chips (e.g., Cavium's OCTEON). Intel's QuickAssist Technology includes compression accelerators. Energy efficiency is one motive for such accelerators. – Sentience 4/5, 2018 at 23:19

@PaulA.Clayton Not included, they are standalone chips. But I’ve updated my answer adding your links. – Wasting 4/5, 2018 at 23:37

"Compression speed often limited by RAM latency. Dedicated instruction can’t improve speed, bigger and faster caches can, but modern CPUs already have very sophisticated multi-level caches. They work good enough for compression already." Are you saying that all researchers who have studied implementing compression algorithms in hardware for many years are insane? – Backandforth 5/5, 2018 at 0:30

"They don’t." Is that for compression or AES or both? x86 does include instructions for AES. – Backandforth 5/5, 2018 at 0:31

@HadiBrais "Are you saying that all researchers who have studied implementing compression algorithms in hardware for many years are insane?" No, I'm not. – Wasting 5/5, 2018 at 0:50

@PaulA.Clayton and Soonts: Intel QuickSync is on-chip hardware video compression, into h.264 / h.265 / JPEG / VP8/9 / VC-1. It's a bit of a special case, though, because it's built-in to the GPU which just happens to be on-die, not the IA cores. And it's only lossy video compression. (Unless it has a lossless mode for any of the formats? Probably not). – Dahlgren 5/5, 2018 at 13:15

@PeterCordes Right, I know about QuickSync. But because the OP written “compression is used in about everything”, I concluded that they are not interested in the multimedia special case. For multimedia compression CPUs offer a lot, albeit not directly. That was a selling point for various vector extension, e.g. long ago mmx meant “MultiMedia eXtension”. – Wasting 5/5, 2018 at 13:39

I thought it was at least worth mentioning that there is fixed-function compression hardware in modern x86 CPUs. But yeah, good point that audio/video (de)compression is still one of the major use-cases for SIMD inside the CPU cores. psadbw is obviously designed for doing two 8x8 motion-searches in parallel (in the SSE2 XMM version), in video compression. (And has other uses, of course, like horizontal sum of bytes). SSE4.1 even added mpsadbw, but it's not fast enough to be worth using for exhaustive motion search. – Dahlgren 5/5, 2018 at 13:45

Intel's new Sapphire Rapids chip will include an integrated compression accelerator (although it's still accessed over PCI, it's not an ISA extension) – Gettysburg 4/9, 2022 at 18:26

Many applications in all kinds of domains certainly can benefit from and do use data compression algorithms. So it would be nice to have hardware support for compression and/or decompression, similar to having hardware support for other popular functions such as encryption/decryption, various mathematical transformations, bit counting, and others. However, compression/decompression typically operate on large amounts of data (many MBs or more) and different algorithms exhibit different memory access patterns that are potentially either not friendly to traditional memory hierarchies or even adversely impacted by them. In addition, as a result of operating on large amounts of data and if implemented directly in the main CPU pipeline, the CPU would almost be fully busy for long periods of time doing compression or decompression. On the other hand, consider encryption for example, encrypting small amounts of data is typical, and so it would make sense to have hardware support for encryption directly in the CPU.

It is precisely for these reasons why hardware compression/decompression engines (accelerators) have been implemented either as ASICs or on FPGAs by many companies as coprocessors (on-die, on-package, or external) or expansion cards (connected through PCIe/NVMe) including:

Intel QuickAssist adapters.
Microsoft Xpress.
IBM PCIe data compression/decompression card.
Cisco hardware compression adapters.
AHA378.
Many academic porposals.

That said, it is possible to achieve very high throughputs on a single modern x86 core. Intel published a paper in 2010 in which it discusses the results of an implementation, called igunzip, of the DEFLATE decompression algorithm. They used a single Nehalem-based physical core and experimented with using a single logical core and two logical cores. They achieve impressive decompression throughputs of more than 2 Gbits/s. The key x86 instruction is PCLMULQDQ. However, modern hardware accelerators (such as QuickAssist) can perform about 10 times faster.

Intel has a number of related patents:

Although it's hard to determine which Intel products employed the techniques or designs proposed in these patents.

Backandforth answered 5/5, 2018 at 19:45 Comment(2)

Password hashing isn't a great example, I don't think; it's too much of a one-off. En/Decrypting multiple small packets with the same key is where AES instructions built-in to a CPU core really shine, vs. a separate accelerator outside the core, or especially off-chip where it can't read from L1d. – Dahlgren 5/5, 2018 at 22:51

@PeterCordes Thanks Peter, I agree. Also I changed >1KB, because I think it's typical to en/decrypt at that scale (e.g., web browsers). I'm guessing here. – Backandforth 5/5, 2018 at 23:2

IBM added hardware accelerators and instructions for 842 compression (a variation on Lempel-Ziv with limited dictionary length) to their Power processors from POWER7+ (2010) onward - https://ieeexplore.ieee.org/document/6665020 , https://www.ibm.com/support/pages/system/files/inline-files/DB2_POWER_NX842_Compression_V1.1.pdf .

In addition, POWER9 (2016) and Power10 added hardware acceleration for the RFC 1951 Deflate algorithm, which is based on LZ77 and is used by zlib and gzip - https://community.ibm.com/community/user/power/blogs/brian-veale1/2022/03/14/gzip-acceleration-with-aix-on-power-systems

Hoyle answered 10/2 at 5:45 Comment(0)

IBM z15 and above processor chips have zlib/gzip accelerators. See the zlib software support and the WITH_DFLTCC* build argument

Stodge answered 28/6 at 23:58 Comment(0)

Recommended topics

Hot tags