avx Questions
1
Solved
I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...
4
Solved
1
Solved
std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler).
The vectorized implementation would compare and replace several ele...
Danille asked 2/3, 2024 at 10:39
4
I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0...
Laaspere asked 23/7, 2018 at 9:37
2
Solved
I need to bit scan reverse with LZCNT an array of words: 16 bits.
The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43
4
Solved
uint8_t data[] = "mykeyxyz:1234\nky:123\n...";.
My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...
Laird asked 11/1, 2024 at 14:59
1
There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass):
i...
Cocainism asked 23/12, 2023 at 9:46
4
Solved
I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers.
...a0.........b0......
...a1.........b1......
// ...
...a7.......
1
Solved
I am learning about C avx intrinsics and I am wondering how this works.
I am familiar that I can do something like this:
__m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
H...
1
I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...
Halfmoon asked 5/3, 2019 at 18:32
4
Solved
SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9
0
I have a curious problem porting working numerical code from Intel 2023 & MSC Visual C++ 2022.
The code compiled with GCC is perfectly accurate (too accurate) since some library calls are worki...
Circumpolar asked 25/10, 2023 at 16:36
2
Solved
Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this:
i...
3
Solved
Consider following float loop, compiled using -O3 -mavx2 -mfma
for (auto i = 0; i < a.size(); ++i) {
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
Clang done perfect job at vectorizing it. It...
Desiccant asked 13/7, 2023 at 23:17
5
Solved
Using MSVC 2013 and AVX 1, I've got 8 floats in a register:
__m256 foo = mm256_fmadd_ps(a,b,c);
Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX in...
Anticlimax asked 3/6, 2016 at 10:51
2
Solved
I wanted to explore auto-vectorization by gcc (10.3). I have the following short program (see https://godbolt.org/z/5v9a53aj6) which computes the sum of all elements of a vector:
#include <stdio...
Tobacco asked 21/10, 2022 at 10:12
2
Solved
I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....
1
Solved
I was tasked with implementing an optimised matrix multiplication micro-kernel that computes C = A*B in C++ starting from the following snippet of code. I am getting some counter intuitive behaviou...
Owing asked 4/3, 2023 at 17:52
1
Introduction of the problem
I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructio...
Hogan asked 9/7, 2019 at 11:42
3
Solved
3
Solved
Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm25...
Hendecahedron asked 20/4, 2018 at 12:27
3
TL;DR:
Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it.
My hos...
Abiogenetic asked 18/1, 2021 at 18:56
7
Solved
Is the following code valid to check if a CPU supports the SSE3 instruction set?
Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP.
bool CheckSSE3()
{
int CPUIn...
Yuk asked 25/5, 2011 at 8:49
5
Solved
I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires ...
1
Solved
I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays tha...
Meingoldas asked 16/9, 2022 at 3:18
1 Next >
© 2022 - 2025 — McMap. All rights reserved.