avx Questions

1

Solved

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...
Oxazine asked 10/5, 2020 at 19:28

4

Solved

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register? The goal is to get the XOR of all 4 64-bit components of an AVX register. ...
Oviposit asked 5/7, 2017 at 21:0

1

Solved

std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler). The vectorized implementation would compare and replace several ele...
Danille asked 2/3, 2024 at 10:39

4

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0...
Laaspere asked 23/7, 2018 at 9:37

2

Solved

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43

4

Solved

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";. My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...
Laird asked 11/1, 2024 at 14:59

1

There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass): i...
Cocainism asked 23/12, 2023 at 9:46

4

Solved

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.......
Marolda asked 27/2, 2017 at 23:58

1

Solved

I am learning about C avx intrinsics and I am wondering how this works. I am familiar that I can do something like this: __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0); H...
Paige asked 29/11, 2023 at 20:50

1

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...
Halfmoon asked 5/3, 2019 at 18:32

4

Solved

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9

0

I have a curious problem porting working numerical code from Intel 2023 & MSC Visual C++ 2022. The code compiled with GCC is perfectly accurate (too accurate) since some library calls are worki...
Circumpolar asked 25/10, 2023 at 16:36

2

Solved

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...
Womenfolk asked 28/7, 2017 at 14:36

3

Solved

Consider following float loop, compiled using -O3 -mavx2 -mfma for (auto i = 0; i < a.size(); ++i) { a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0; } Clang done perfect job at vectorizing it. It...
Desiccant asked 13/7, 2023 at 23:17

5

Solved

Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX in...
Anticlimax asked 3/6, 2016 at 10:51

2

Solved

I wanted to explore auto-vectorization by gcc (10.3). I have the following short program (see https://godbolt.org/z/5v9a53aj6) which computes the sum of all elements of a vector: #include <stdio...
Tobacco asked 21/10, 2022 at 10:12

2

Solved

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....
Grasmere asked 28/1, 2019 at 19:9

1

Solved

I was tasked with implementing an optimised matrix multiplication micro-kernel that computes C = A*B in C++ starting from the following snippet of code. I am getting some counter intuitive behaviou...
Owing asked 4/3, 2023 at 17:52

1

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructio...
Hogan asked 9/7, 2019 at 11:42

3

Solved

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using t...
Garrotte asked 7/3, 2014 at 17:17

3

Solved

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm25...
Hendecahedron asked 20/4, 2018 at 12:27

3

TL;DR: Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it. My hos...
Abiogenetic asked 18/1, 2021 at 18:56

7

Solved

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUIn...
Yuk asked 25/5, 2011 at 8:49

5

Solved

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires ...
Kanchenjunga asked 27/5, 2016 at 9:40

1

Solved

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays tha...
Meingoldas asked 16/9, 2022 at 3:18

© 2022 - 2025 — McMap. All rights reserved.