simd Questions

1

Solved

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...
Oxazine asked 10/5, 2020 at 19:28

8

Solved

In my program, I have a statement like the following, inside a loop. y = (x >= 1)? 0:1; However, I want to avoid using any relational operator, because I want to use SIMD instructions, and am...
Inez asked 29/4, 2017 at 9:0

1

Solved

I compared the performance of 3x3 and 4x4 matrix multiplication using Eigen with the -O3 optimization flag, and surprisingly, I found that the 4x4 case is more than twice as fast as the 3x3 case! T...
Tamberg asked 26/8 at 16:23

3

Solved

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general pur...
Footcandle asked 15/4, 2012 at 12:13

2

Solved

How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?
Betimes asked 9/1, 2017 at 19:32

4

Solved

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register? The goal is to get the XOR of all 4 64-bit components of an AVX register. ...
Oviposit asked 5/7, 2017 at 21:0

1

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only ...
Melonie asked 19/6, 2022 at 5:11

20

I'm looking to optimize this linear search: static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } return i; } The array i...
Rubirubia asked 30/4, 2010 at 1:50

3

Solved

I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-si...
Minny asked 21/3, 2017 at 2:51

4

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0...
Laaspere asked 23/7, 2018 at 9:37

2

Solved

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43

1

Solved

The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clan...
Calyptra asked 16/2 at 22:17

3

I want to clamp 32-bit unsigned ints to fixed value (0x10000) using only SSE2 instructions. Basically, this C code: if (c>0x10000) c=0x10000; This code below works, but I'm wondering if it can b...
Franzen asked 2/2 at 17:46

3

Solved

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would lik...
Cabasset asked 27/12, 2019 at 0:23

4

Solved

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";. My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...
Laird asked 11/1 at 14:59

3

Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__buil...
Wolff asked 12/2, 2021 at 22:17

1

Solved

I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclas...
Underwing asked 25/12, 2023 at 7:56

1

There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass): i...
Cocainism asked 23/12, 2023 at 9:46

1

Solved

I have this function in C++ void routine2(float alpha, float beta) { unsigned int i, j; for (i = 0; i < N; i++) for (j = 0; j < N; j++) w[i] = w[i] - beta + alpha * A[i][j] * x[j]; } ...
Hebetate asked 19/12, 2023 at 21:4

4

Solved

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.......
Marolda asked 27/2, 2017 at 23:58

4

I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions. I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effec...
Soggy asked 21/3, 2022 at 4:19

1

Solved

I am learning about C avx intrinsics and I am wondering how this works. I am familiar that I can do something like this: __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0); H...
Paige asked 29/11, 2023 at 20:50

1

Solved

My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still ...
Complacence asked 19/11, 2023 at 4:40

4

Solved

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for m...
Plantain asked 26/6, 2015 at 9:13

4

Solved

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9

© 2022 - 2024 — McMap. All rights reserved.