simd Questions
1
Solved
I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...
8
Solved
In my program, I have a statement like the following, inside a loop.
y = (x >= 1)? 0:1;
However, I want to avoid using any relational operator, because I want to use SIMD instructions, and am...
Inez asked 29/4, 2017 at 9:0
1
Solved
I compared the performance of 3x3 and 4x4 matrix multiplication using Eigen with the -O3 optimization flag, and surprisingly, I found that the 4x4 case is more than twice as fast as the 3x3 case! T...
Tamberg asked 26/8 at 16:23
3
Solved
Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed?
Ideally I am looking for something like PUSH or POP for general pur...
2
Solved
How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?
Betimes asked 9/1, 2017 at 19:32
4
Solved
1
My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only ...
Melonie asked 19/6, 2022 at 5:11
20
I'm looking to optimize this linear search:
static int
linear (const int *arr, int n, int key)
{
int i = 0;
while (i < n) {
if (arr [i] >= key)
break;
++i;
}
return i;
}
The array i...
Rubirubia asked 30/4, 2010 at 1:50
3
Solved
I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-si...
Minny asked 21/3, 2017 at 2:51
4
I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0...
Laaspere asked 23/7, 2018 at 9:37
2
Solved
I need to bit scan reverse with LZCNT an array of words: 16 bits.
The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43
1
Solved
The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clan...
Calyptra asked 16/2 at 22:17
3
3
Solved
I will preface this by saying that I am a complete beginner at SIMD intrinsics.
Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would lik...
Cabasset asked 27/12, 2019 at 0:23
4
Solved
uint8_t data[] = "mykeyxyz:1234\nky:123\n...";.
My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...
Laird asked 11/1 at 14:59
3
Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__buil...
1
Solved
I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclas...
Underwing asked 25/12, 2023 at 7:56
1
There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass):
i...
Cocainism asked 23/12, 2023 at 9:46
1
Solved
I have this function in C++
void routine2(float alpha, float beta) {
unsigned int i, j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] - beta + alpha * A[i][j] * x[j];
}
...
Hebetate asked 19/12, 2023 at 21:4
4
Solved
I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers.
...a0.........b0......
...a1.........b1......
// ...
...a7.......
4
I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions.
I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effec...
1
Solved
I am learning about C avx intrinsics and I am wondering how this works.
I am familiar that I can do something like this:
__m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
H...
1
Solved
My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still ...
Complacence asked 19/11, 2023 at 4:40
4
Solved
I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for m...
Plantain asked 26/6, 2015 at 9:13
4
Solved
SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9
1 Next >
© 2022 - 2024 — McMap. All rights reserved.