avx2 - McMap

3

Solved

Is there any data on the latency of an AVX2 gather instruction?

Is there any data on AVX2 gather latency? (for instance a _mm256_i32gather_ps instruction accessing a single cache line)

performance x86 latency micro-optimization avx2

Parrott asked 22/7, 2013 at 14:18

2

Solved

Is using AVX2 can implement a faster processing of LZCNT on a word array?

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...

x86 simd avx micro-optimization avx2

Milan asked 15/5, 2019 at 15:43

0

Are there processors on which VPMASKMOVD generates faults for the masked-out elements?

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due t...

assembly x86 avx2 amd-processor

Eastereasterday asked 28/1 at 15:16

3

Solved

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would lik...

c++simd avx2 dot-product fma

Cabasset asked 27/12, 2019 at 0:23

2

Solved

Do all CPUs that support AVX2 also support BMI2 or popcnt?

From here, I learned that the support of AVX doesn't imply the support of BMI1. So how about AVX2: Do all CPUs that support AVX2 also support BMI2? Further, does the support of AVX2 imply the suppo...

assembly x86-64 avx2 bmi

Palermo asked 8/6, 2023 at 1:33

3

Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++)

Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__buil...

c++simd clang++avx2

Wolff asked 12/2, 2021 at 22:17

1

Solved

Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?

I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclas...

c++benchmarking simd avx2 vector-class-library

Underwing asked 25/12, 2023 at 7:56

4

Solved

Packing and de-interleaving two __m256 registers

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.......

c++x86 simd avx avx2

Marolda asked 27/2, 2017 at 23:58

1

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...

x86 cpu-architecture avx avx2 avx512

Halfmoon asked 5/3, 2019 at 18:32

4

Solved

Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for m...

c++performance parallel-processing simd avx2

Plantain asked 26/6, 2015 at 9:13

2

Solved

What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....

c++assembly sse avx avx2

Grasmere asked 28/1, 2019 at 19:9

1

Solved

Does Zen 4 core have 48 flops per cycle for 32-bit precision fp?

Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory): Each FMA does 1 multiplication ...

performance x86-64 cpu-architecture avx2 amd-processor

Prieto asked 7/5, 2023 at 15:45

0

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array. On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd ...

c++clang compiler-optimization avx2 avx512

Claypool asked 6/1, 2023 at 12:21

1

Solved

Extracting edges of AVX2 16x16 bitmatrix

Is there a relatively cheap way to extract the four edges (rows 0 and 15, and columns 0 and 15) of a 16x16 bitmatrix stored in a __m256i into four 16b lanes of a __m256i? I don't care which lanes t...

c bit-manipulation intrinsics avx2

Urethrectomy asked 31/12, 2022 at 3:42

3

Solved

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm25...

c++optimization sse avx avx2

Hendecahedron asked 20/4, 2018 at 12:27

3

How to enable AVX / AVX2 in VirtualBox 6.1.16 with Ubuntu 20.04 64bit?

TL;DR: Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it. My hos...

tensorflow virtual-machine virtualbox avx avx2

Abiogenetic asked 18/1, 2021 at 18:56

2

Solved

x86 SIMD – packing 8-bit compare results into 32-bit entries

I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb lookup on the lowest bytes. ...

c x86 avx2 avx512

Ridgeway asked 20/10, 2022 at 4:11

5

Solved

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires ...

linux unix avx suse avx2

Kanchenjunga asked 27/5, 2016 at 9:40

4

Solved

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My at...

x86 simd avx vector-processing avx2

Gage asked 20/3, 2012 at 21:48

1

Solved

Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support?

In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addi...

x86 x86-64 simd avx2

Brachypterous asked 6/6, 2022 at 19:59

3

How to create a left-packed vector of indices of the 0s in one SIMD vector?

Please tell me, I can't figure it out myself: Here I have __m128i SIMD vector - each of the 16 bytes contains the following value: 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 Is it possible to somehow transf...

c++c simd avx2

Fredericton asked 3/5, 2022 at 10:47

5

Solved

Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. R...

x86 simd avx exponential avx2

Tarpaulin asked 19/2, 2018 at 10:8

4

Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?

Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this: const char* const str = "12345678"; const char* const b...

c++parsing simd avx2 atoi

Krug asked 22/3, 2022 at 11:0

1

Solved

How to compare two vectors using SIMD and get a strncmp like result?

I want to achieve something like strncmp result but not that complicated I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed ...

c simd avx avx2

Brooking asked 8/2, 2022 at 14:18

2

Solved

Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values

I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics. In the following scenario, I failed ...

c++performance optimization simd avx2

Charqui asked 25/1, 2022 at 18:15

avx2 Questions

Recommended topics

Hot tags