avx2 Questions
3
Solved
Is there any data on AVX2 gather latency?
(for instance a _mm256_i32gather_ps instruction accessing a single cache line)
Parrott asked 22/7, 2013 at 14:18
2
Solved
I need to bit scan reverse with LZCNT an array of words: 16 bits.
The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43
0
Are there processors on which VPMASKMOVD generates faults for the masked-out elements?
Going by the Intel Software Developer's Manual, the answer is plainly "no":
Faults occur only due t...
Eastereasterday asked 28/1 at 15:16
3
Solved
I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would lik...
Cabasset asked 27/12, 2019 at 0:23
2
Solved
From here, I learned that the support of AVX doesn't imply the support of BMI1. So how about AVX2: Do all CPUs that support AVX2 also support BMI2? Further, does the support of AVX2 imply the suppo...
3
Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__buil...
1
Solved
I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclas...
Underwing asked 25/12, 2023 at 7:56
4
Solved
I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers.
...a0.........b0......
...a1.........b1......
// ...
...a7.......
1
I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...
Halfmoon asked 5/3, 2019 at 18:32
4
Solved
I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for m...
Plantain asked 26/6, 2015 at 9:13
2
Solved
I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....
1
Solved
Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory):
Each FMA does 1 multiplication ...
Prieto asked 7/5, 2023 at 15:45
0
I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array.
On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd ...
Claypool asked 6/1, 2023 at 12:21
1
Solved
Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b lanes of a __m256i? I don't care which lanes t...
Urethrectomy asked 31/12, 2022 at 3:42
3
Solved
Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm25...
Hendecahedron asked 20/4, 2018 at 12:27
3
TL;DR:
Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it.
My hos...
Abiogenetic asked 18/1, 2021 at 18:56
2
Solved
I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. ...
5
Solved
I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires ...
4
Solved
I have a __m256d vector packed with four 64-bit floating-point values.
I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value;
My at...
Gage asked 20/3, 2012 at 21:48
1
Solved
In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addi...
3
Please tell me, I can't figure it out myself:
Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
Is it possible to somehow transf...
5
Solved
I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML.
R...
Tarpaulin asked 19/2, 2018 at 10:8
4
Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this:
const char* const str = "12345678";
const char* const b...
1
Solved
I want to achieve something like strncmp result but not that complicated
I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed ...
2
Solved
I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics.
In the following scenario, I failed ...
Charqui asked 25/1, 2022 at 18:15
1 Next >
© 2022 - 2024 — McMap. All rights reserved.