simd - McMap

1

Solved

Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...

x86-64 simd avx avx512

Oxazine asked 10/5, 2020 at 19:28

8

Solved

Is there a way to convert an integer to 1 if it is >= 1 without using any relational operator?

In my program, I have a statement like the following, inside a loop. y = (x >= 1)? 0:1; However, I want to avoid using any relational operator, because I want to use SIMD instructions, and am...

c math boolean logical-operators simd

Inez asked 29/4, 2017 at 9:0

1

Solved

Why is 4x4 Matrix Multiplication in Eigen More Than Twice as Fast as 3x3?

I compared the performance of 3x3 and 4x4 matrix multiplication using Eigen with the -O3 optimization flag, and surprisingly, I found that the 4x4 case is more than twice as fast as the 3x3 case! T...

c++assembly eigen matrix-multiplication simd

Tamberg asked 26/8, 2024 at 16:23

3

Solved

Push XMM register to the stack

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general pur...

assembly x86 simd sse

Footcandle asked 15/4, 2012 at 12:13

2

Solved

Divide 8-bit integers by 4 (or shift) using SSE

How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?

c++x86 sse simd intrinsics

Betimes asked 9/1, 2017 at 19:32

4

Solved

Horizontal XOR in AVX

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register? The goal is to get the XOR of all 4 64-bit components of an AVX register. ...

c++assembly x86 simd avx

Oviposit asked 5/7, 2017 at 21:0

1

Is there a special benefit to consuming whole cache lines between iterations of a loop?

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only ...

c++visual-c++cpu-architecture simd cpu-cache

Melonie asked 19/6, 2022 at 5:11

20

How fast can you make linear search?

I'm looking to optimize this linear search: static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } return i; } The array i...

c search optimization simd linear-search

Rubirubia asked 30/4, 2010 at 1:50

3

Solved

Does browser JavaScript allow for SIMD or Vectorized operations?

I want to write applications in JavaScript that require a large amount of numerical computation. However, I'm very confused about the state of efficient linear-algebra-like computation in client-si...

javascript matrix vector vectorization simd

Minny asked 21/3, 2017 at 2:51

4

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0...

bit-manipulation transpose simd avx bitcount

Laaspere asked 23/7, 2018 at 9:37

2

Solved

Is using AVX2 can implement a faster processing of LZCNT on a word array?

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...

x86 simd avx micro-optimization avx2

Milan asked 15/5, 2019 at 15:43

1

Solved

Why does GCC generate code that conditionally executes a SIMD implementation?

The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clan...

c++gcc simd auto-vectorization

Calyptra asked 16/2, 2024 at 22:17

3

Clamp unsigned int to 0x10000 using SSE2

I want to clamp 32-bit unsigned ints to fixed value (0x10000) using only SSE2 instructions. Basically, this C code: if (c>0x10000) c=0x10000; This code below works, but I'm wondering if it can b...

assembly x86 simd sse2 clamp

Franzen asked 2/2, 2024 at 17:46

3

Solved

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would lik...

c++simd avx2 dot-product fma

Cabasset asked 27/12, 2019 at 0:23

4

Solved

Fastest way to mask out bytes higher than separator position with SIMD

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";. My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...

c++assembly optimization simd avx

Laird asked 11/1, 2024 at 14:59

3

Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++)

Pretty much what the title says, i need a way to shift/shuffle the positions of all elements in a 256-avx-register register by N places. all i have found about this uses 32 or 64 bit values (__buil...

c++simd clang++avx2

Wolff asked 12/2, 2021 at 22:17

1

Solved

Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?

I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclas...

c++benchmarking simd avx2 vector-class-library

Underwing asked 25/12, 2023 at 7:56

1

Looking for an efficient function to find an index of max element in SIMD vector using a library

There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass): i...

c++optimization simd avx vector-class-library

Cocainism asked 23/12, 2023 at 9:46

1

Solved

How to vectorize a vector-matrix product with SSE?

I have this function in C++ void routine2(float alpha, float beta) { unsigned int i, j; for (i = 0; i < N; i++) for (j = 0; j < N; j++) w[i] = w[i] - beta + alpha * A[i][j] * x[j]; } ...

c++matrix-multiplication simd sse dot-product

Hebetate asked 19/12, 2023 at 21:4

4

Solved

Packing and de-interleaving two __m256 registers

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.......

c++x86 simd avx avx2

Marolda asked 27/2, 2017 at 23:58

4

fast bit-matrix (64x64) transpose algorithm using SIMD (ARM)

I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions. I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effec...

assembly arm transpose simd neon

Soggy asked 21/3, 2022 at 4:19

1

Solved

How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector?

I am learning about C avx intrinsics and I am wondering how this works. I am familiar that I can do something like this: __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0); H...

c simd avx

Paige asked 29/11, 2023 at 20:50

1

Solved

.NET8 supports Vector512, but why doesn't Vector reach 512 bits?

My CPU is AMD Ryzen 7 7840H which supports AVX-512 instruction set. When I run the .NET8 program, the value of Vector512.IsHardwareAccelerated is true. But System.Numerics.Vector<T> is still ...

c#simd intrinsics avx512 .net-8.0

Complacence asked 19/11, 2023 at 4:40

4

Solved

Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for m...

c++performance parallel-processing simd avx2

Plantain asked 26/6, 2015 at 9:13

4

Solved

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...

c++floating-point sse simd avx

Apotheosize asked 14/12, 2016 at 14:9

simd Questions

Recommended topics

Hot tags