avx512 Questions

1

Solved

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...
Oxazine asked 10/5, 2020 at 19:28

1

Solved

On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats. I have found intrinsics to convert FP32 values to BF16 values (for example: _mm512_cvtne2...
Elementary asked 2/5, 2024 at 13:42

1

Solved

I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x8...
Ermeena asked 21/1, 2024 at 0:13

1

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...
Halfmoon asked 5/3, 2019 at 18:32

1

Solved

My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still ...
Complacence asked 19/11, 2023 at 4:40

4

Solved

One of the AVX-512 instruction set extensions is AVX-512 + GFNI, " Galois Field New Instructions". Galois theory is about field extensions. What does that have to do with processing vectorized int...
Amenity asked 1/12, 2019 at 10:39

2

Solved

I wanted to explore auto-vectorization by gcc (10.3). I have the following short program (see https://godbolt.org/z/5v9a53aj6) which computes the sum of all elements of a vector: #include <stdio...
Tobacco asked 21/10, 2022 at 10:12

2

Solved

There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU. In particular there is a instuction _mm512_dpbusd_epi32...
Irmairme asked 16/6, 2021 at 9:4

0

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd ...
Claypool asked 6/1, 2023 at 12:21

1

Solved

The motivation for this question The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if th...
Arianearianie asked 13/12, 2022 at 13:5

2

Solved

I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. ...
Ridgeway asked 20/10, 2022 at 4:11

1

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M). #include <immintrin.h> #include <algorithm> #include <ctime> #include <iostream&gt...
Skean asked 14/10, 2022 at 12:41

1

Can masking improve the performance of AVX-512 memory operations (load/store/gather/scatter and non-shuffling load-ops)? Seeing as masked out elements don't trigger memory faults, one would assume ...
Midmost asked 10/8, 2022 at 10:18

2

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all...
Stempien asked 22/2, 2022 at 22:56

0

With an implicit loop-vectorization experiment, GCC 11.2 does not produce fma instructions but only packed add and packed multiply instructions: https://godbolt.org/z/srbfWMEG6 Sample code for test...
Inferno asked 17/4, 2022 at 15:2

1

I'm testing the memory bandwidth on a desktop and a server. Sklyake desktop 4 cores/8 hardware threads Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads The pea...
Lil asked 28/6, 2019 at 9:5

3

Solved

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions. Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack v...
Billups asked 20/11, 2021 at 12:9

3

Solved

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.) Digits can be stored in memory or printed on the...
Retentivity asked 17/12, 2018 at 22:14

2

From the value we can infer that it uses the same components as double-precision floating-point hardware. But double has 53 bits of significand, so why is AVX512-IFMA limited to 52 bits? Sure the m...
Homeomorphism asked 4/3, 2015 at 18:23

1

Lets say you call _mm512_mask_store_ps, from the point of view of the CPU's write buffer, is it executed as a store of size 64-bytes (with some sort of masking) or is it executed internally as mult...
Analytic asked 3/9, 2020 at 20:47

2

Solved

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI[1] is enabled by the compiler ? Ideall...
Bowing asked 9/3, 2015 at 10:23

1

Solved

I need to disable all AVX512 extensions in gcc-compiled code. The reason is that Valgrind chokes on AVX512 instructions. Is there a way to do it with a single flag? I know how to disable each ext...
Glindaglinka asked 23/3, 2020 at 14:17

1

Solved

The EVEX.z bit is used in AVX-512 in conjunction with the k registers to control masking. If the z bit is 0, it's merge-masking and if the z bit is 1 the zero elements in the k register are zeroed ...
Ixion asked 20/3, 2020 at 16:52

1

Solved

My goal is to create a PCIe transaction with more than 64b payload. For that I need to read an ioremap() address. For 128b and 256b I can use xmm and ymm registers respectively and that works as ...
Banister asked 16/3, 2020 at 3:15

1

Solved

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extra...
Botha asked 7/2, 2020 at 7:8

© 2022 - 2025 — McMap. All rights reserved.