There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU.
In particular there is a instuction _mm512_dpbusd_epi32
(vpdpbusd
) which allows to perform multiplication of 8-bit signed and unsigned integers and accumulate them into 32-bit integer accumulators.
There is a pseudo code of this instruction below:
void _mm512_dpbusd_epi32(int32_t sum[16], uint8_t a[16][4], int8_t b[16][4])
{
for(int i = 0; i < 16; ++i)
sum[i] +=
(int)a[i][0]*b[i][0] + (int)a[i][1]*b[i][1] +
(int)a[i][2]*b[i][2] + (int)a[i][3]*b[i][3];
}
Unfortunately the intel CPUs until Cascade Lake don't have this instruction so there is a question to emulate this one with using of previous extension (for example AVX-512BW). So my question is: How is make this emulation maximal effective as possible?
i8 * i8
version of the same trick is possible, usingset1_epi16(-1)
for the MSB sums, or both using the same constant but subtracting instead of adding. I updated my answer on How to implement an efficient _mm256_madd_epi8? with this cool split trick. – Graiae