What are the AVX-512 Galois-field-related instructions for?
Asked Answered
A

4

12

One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions".

Galois theory is about field extensions. What does that have to do with processing vectorized integer or floating-point values? The instructions supposedly perform "Galois field affine transformation", the inverse of that, and "Galois field multiply bytes".

What fields are those? What do these instructions actually do and what is it good for?

Amenity answered 1/12, 2019 at 10:39 Comment(0)
A
9

These instructions are closely related to the AES (Rijndael) block cipher. GF2P8AFFINEINVQB performs a Rijndael S-Box substitution with a user-defined affine transformation.

GF2P8AFFINEQB is essentially a (carry-less) multiplication of an 8x8 bit matrix with an 8-bit vector in GF(2), so it should be useful in other bit-oriented algorithms. It can also be used to convert between isomorphic representations of GF(28).

GF2P8MULB multiplies two (vectors of) elements of GF(28), actually 8-bit numbers in polynomial representation with the Rijndael reduction polynomial. This operation is used in Rijndael's MixColumns step.

Note that multiplication in finite fields is only loosely related to integer multiplication.

Athodyd answered 1/12, 2019 at 11:58 Comment(4)
Interesting, but why would you ever use these instead of AES-NI? aesenc performs one round of AES encryption on a 128-bit XMM vector. I think if they wanted to accelerate multiple AES streams interleaved in one ZMM vector, Intel would have widened the AES instructions.Semasiology
@PeterCordes I don't think the new instructions are meant to replace AES-NI. My best guess is that Intel is simply adding instructions for some AES suboperations because it's cheap to reuse silicon of the already existing AES implementation.Athodyd
That would make sense; I hadn't realized GFNI only supported 128-bit vectors, same as AES-NI; I had been thinking it was 512-bit wide ZMM vectors! And well spotted that gf2p8mulb's baked-in polynomial of x^8 + x^4 + x^3 + x + 1 matches Rijndael's; that lends a lot of weight to your hypothesis, especially if it isn't used in any other common application (like a widespread RAID6 implementation)Semasiology
Correction, GFNI is not limited to 128-bit. The HTML extract of Intel's manual I was looking at didn't include the GFNI+AVX 256-bit form, or the GFNI+AVX512 512-bit form. But there's also a VAES extension that widens AES-NI. (That might exist partly because they were going to widen the GF hardware for AVX512 + GFNI. IIRC you need to interleave multiple data streams to use it, not speed up a single AES stream?)Semasiology
S
6

One of the major use-cases is I think SW RAID6 parity, including generating new parity on every write. (Not just during recovery / rebuild). RAID5 can use simple XOR parity for its one and only parity member of each stripe, but RAID6 needs two different parities that can recover N blocks of data from any N of the N+2 blocks of data+parity. This is forward error correction, a similar kind of problem that ECC solves.

Galois Fields are useful for this; they're the basis of widely-used Reed-Solomon codes, for example. e.g. Par2 uses 16-bit Galois Fields to allow very large block counts to generate relatively fine-grained error-recovery data for a large file or set of files. (Up to 64k blocks).

Unfortunately GFNI is not great for PAR2 because GFNI only supports GF2P8 GF(28), not the GF(216) that par2 uses. http://lab.jerasure.org/jerasure/gf-complete/issues/14 says it's possible to use GF2P8AFFINEQB to implement wider word sizes so it might be possible to speed up PAR2 with it.

But it should be useful for RAID6, including generating new parity on writes which is pretty CPU intensive. The Linux kernel's md driver already includes inline asm to use SSE2 or AVX2, one of the few uses of kernel_fpu_begin() and kernel_fpu_end(). (A 2013 paper looks at optimizing GF coding using Intel SIMD, mentioning Linux's md RAID and GF-Complete, the project linked earlier. The current state of the art is something like two pshufb byte shuffles to implement a 4-bit table lookup; GFNI could bring that down to 1 instruction especially if the hard-coded GF polynomial baked into gf2p8mulb is used.)

(RAID6 uses parity in a different way than par2, generating separate parity for each stripe "vertically" across disks, instead of "horizontally" for one big array of data. The underlying math is similar.)

Intel pretty probably plans to support GFNI on some future Silvermont-family Atom because there are legacy-SSE encodings of the instructions, without 3-operand VEX or EVEX. Many other new instructions are introduced with only VEX encodings, including some of the BMI1/BMI2 scalar integer instructions.

Silvermont-family (Airmont, Goldmont, Tremont, ...) gets some use in NAS appliances where most of the CPU demand could come from RAID6. A future version of it with GFNI could save power, or avoid bottlenecks without raising clock speed.

AVX + GFNI implies support for a YMM version (even without AVX2), and AVX512F + GFNI implies a ZMM version. (The HTML extract at felixcloutier.com strangely only mentions the non-VEX 128-bit encoding while also listing a _mm_maskz_gf2p8affine_epi64_epi8 intrinsic (masking requires EVEX). HJLebbink's HTML extract does include the VEX and EVEX forms. Maybe they only appear in Intel's "future extensions" manual which HJ scrapes but Felix doesn't.)

Using 512-bit vectors does limit turbo clock speeds for a short time after (on Skylake-Xeon), so it might not be desirable for the kernel to do that. But it could give a significant reduction in CPU overhead for some cases, if you're not memory-bound.


A "field" is a mathematical concept:

(wikipedia) In mathematics, a field is a set on which addition, subtraction, multiplication, and division are defined and behave as the corresponding operations on rational and real numbers do.

...

including the existence of an additive inverse −a for all elements a, and of a multiplicative inverse b−1 for every nonzero element b

Galois Fields are a kind of Finite Field which have this property: the bits in a GF8 number represent 0 or 1 coefficients of a polynomial of degree 8. (It's quite possible I totally butchered that, but it's something like that rather than place-value.) That's why carryless addition (aka XOR) and carryless multiplication (using shift/XOR instead of shift/add) is useful over Galois fields)

gf2p8mulb's baked-in polynomial of x^8 + x^4 + x^3 + x + 1 matches the one used in AES (Rijndael); this lends more weight to @nwellnhof's hypothesis that Intel just included it because the HW was there.

If it's also used in any other common application, it might give us a clue of the "intended" use case for these instructions.

There is a VAES extension that provides versions of AESENC and related instructions for YMM and ZMM vectors, up from just 128-bit vectors with AES-NI + AVX2. So Intel apparently is extending AES HW to 512-bit SIMD vectors. IDK if this motivates wide GFNI or vice versa, or some of both. (Wide GFNI makes a huge amount of sense; if it was limited to 128-bit, an optimized AVX512 implementation using vpshufb for lookup tables would beat it.)

Semasiology answered 1/12, 2019 at 11:44 Comment(3)
Oh, those fields. Ok. I know about polynomial extension fields, just not that they're named after Galois.Amenity
Note that the GFNI instructions only work with finite fields of order 2^8, so I think they're primarily geared toward cryptography. You typically use larger fields for error correction. For example, PAR2 uses GF(2^16), not GF(16).Athodyd
@nwellnhof: yes, I'm aware they're not actually useful for PAR2, unfortunately. But I forgot to make that clear in my answer, thanks!. GF-Complete suggests GF2P8AFFINEQB might be usable for 16-bit word sizes using the right tricks. Also thanks for the GF16 vs. GF2^16; I think par2 source code uses gf16 or GaloisField<16> as a type name so I got used to thinking of it as GF16. Fixed my answer.Semasiology
E
4

To answer the purpose part, my guess is that these were added primarily for accelerating SM4 encryption, which shares similarity with AES in design.
This guess comes from the fact that ARM also added SM4 acceleration in ARMv8.4 at around the same time, suggesting that chipmakers want to accelerate this algorithm, probably because it'll gain significant traction in the Chinese market. Also, the fact that it's the only AVX512 extension added in Icelake which also has an SSE encoding, so that Tremont could support it, suggests that they intended it for networking/storage purposes.

GFNI is also quite useful in Reed Solomon coding for error correction (as mentioned by Peter above). It's directly applicable to any GF(28) implementation (such as this) and the affine instruction can be used for other field sizes and polynomials - in fact, it's the fastest technique I know of to do so on an Intel processor.

The affine instruction also has a bunch of out-of-band use cases, including 8-bit shifts and bit-permutes. It's equivalent to RISC-V's bmatxor instruction, where some use cases are listed here.
Some links describing use cases for this instruction.

Eure answered 7/6, 2020 at 11:30 Comment(0)
H
2

VGF2P8AFFINEQB is an underrated instruction that can emulate a weird form of a Programable Gate Array. The first operand is the result, you can picture the second operand as 8 wires hardwired to one of the inputs of 8 XOR gates, the third operand is the second input of these XOR gates, the resulting 8 outputs go to a 9 input AND gate where the 9nth input is a wire corresponding to the adjacent bit of an 8 bit immediate 4th operand. The instruction takes 17 inputs and outputs only one bit.

  • r0=(a0⊕b0)&(a1⊕b1)&(a2⊕b2)&.....&(a7⊕b7)&c0
  • r1=(a0⊕b8)&(a1⊕b9)&(a2⊕b10)&.....&(a7⊕b15)&c1
  • ......
  • r7=(a0⊕b55)&(a1⊕b56)&(a2⊕b57)&.....&(a7⊕b63)&c7
  • r8=(a8⊕b0)&(a9⊕b1)&(a10⊕b2)&.....&(a15⊕b7)&c0
  • r9=(a8⊕b8)&(a9⊕b9)&(a10⊕b10)&.....&(a15⊕b15)&c1
  • ......
  • r15=(a8⊕b55)&(a9⊕b56)&(a10⊕b57)&.....&(a15⊕b63)&c7
  • ......
  • r55=(a55⊕b0)&(a56⊕b1)&(a57⊕b2)&.....&(a63⊕b7)&c0
  • r56=(a55⊕b8)&(a56⊕b9)&(a57⊕b10)&.....&(a63⊕b15)&c1
  • ......
  • r63=(a55⊕b55)&(a56⊕b56)&(a57⊕b57)&.....&(a63⊕b63)&c7
  • r64=(a64⊕b64)&(a65⊕b65)&(a66⊕b66)&.....&(a71⊕b71)&c0
  • r65=(a64⊕b72)&(a65⊕b73)&(a66⊕b74)&.....&(a71⊕b79)&c1
  • ......
  • r71=(a64⊕b120)&(a65⊕b121)&(a66⊕b122)&.....&(a71⊕b127)&c0
  • ......
  • r504=(a504⊕b447)&(a505⊕b448)&(a506⊕b449)&.....&(a511⊕b454)&c0
  • r505=(a504⊕b455)&(a505⊕b456)&(a506⊕b457)&.....&(a511⊕b462)&c1
  • ......
  • r511=(a504⊕b504)&(a505⊕b505)&(a506⊕b506)&.....&(a511⊕b511)&c7

I've been playing with semi-random matrices in the third operand and ffh in the fourth in order to emulate simple circuits.

Haircut answered 11/10, 2023 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.