simd - 2 - McMap

2

Solved

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...

sse simd neon avx

Womenfolk asked 28/7, 2017 at 14:36

1

OpenMP SIMD Multi-Reductions (Sum, Min and Max) in Single For Loop

I have the following loop to calculate basic summary statistics (mean, standard deviation, minimum and maximum) in C++, skipping missing values (x is a double vector): int k = 0; long double sum = ...

c++c openmp simd

Intoxicated asked 22/9, 2023 at 19:54

4

Solved

Can counting byte matches between two strings be optimized using SIMD?

Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j...

c++optimization x86-64 sse simd

Halfdan asked 24/3, 2013 at 13:23

1

Solved

Give the CLANG compiler a loop length assertion

I have a loop that loads two float* arrays into __m256 vectors and processes them. Following this loop, I have code that loads the balance of values into the vectors and then processes them. So the...

c++visual-c++clang compiler-optimization simd

Gass asked 2/6, 2023 at 12:48

2

Solved

AVX-512BW emulation of _mm512_dpbusd_epi32 AVX-512VNNI instruction

There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU. In particular there is a instuction _mm512_dpbusd_epi32...

c++simd avx512 simd-library synet

Irmairme asked 16/6, 2021 at 9:4

0

Is integer vectorization accuracy / precision of integer division CPU-dependent?

I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels. I quickly realized that due to lack of accelerated integer division support I need to convert my values ...

c#vectorization precision simd auto-vectorization

Cai asked 14/3, 2023 at 11:37

2

Solved

Multi-threaded fixed-size matrix-vector multiplication optimized for many-core CPUs with non-uniform caches

I would like to implement a parallel matrix-vector multiplication for a fixed size matrix (~3500x3500 floats) optimized for my CPUs and cache layout (AMD Zen 2/4) that is repeatedly executed for ch...

parallel-processing x86-64 matrix-multiplication simd blas

Strontia asked 25/2, 2023 at 1:2

2

Solved

Why is np.dot so much faster than np.sum?

Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives. For example: In [20]: A = np.random.rand(1000) In [21]: B = np.random.rand(...

python numpy cython simd numba

Lottielotto asked 24/2, 2023 at 11:48

1

C# and SIMD: High and low speedups. What is happening?

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructio...

c#performance x86-64 simd avx

Hogan asked 9/7, 2019 at 11:42

4

Solved

bitpack ascii string into 7-bit binary blob using SIMD

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics. I would like ...

c ascii simd sse intrinsics

Fronton asked 17/12, 2022 at 4:41

3

bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8: static inline uint64_t Compress8x7bit(uint64_t x) { x = ((x & 0x7F00...

simd arm64 intrinsics neon

Shake asked 19/12, 2022 at 5:14

1

Solved

Is `-ftree-loop-vectorize` not enabled by `-O2` in GCC v12?

Example: https://www.godbolt.org/z/ahfcaj7W8 From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html It says -ftree-loop-vectorize Perform loop vectorization on trees. This f...

c++gcc compiler-optimization simd auto-vectorization

Propitious asked 23/12, 2022 at 10:30

1

Solved

Is there any performance difference between AVX-512 `_mm512_load_epi64` and `_mm512_loadu_epi64`?

The motivation for this question The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if th...

x86-64 intel simd amd-processor avx512

Arianearianie asked 13/12, 2022 at 13:5

2

Solved

does gcc's __builtin_cpu_supports check for OS support?

GCC compiler provides a set of builtins to test some processor features, like availability of certain instruction sets. But, according to this thread we also may know certain cpu features may be no...

c gcc simd intrinsics instruction-set

Cyte asked 8/2, 2018 at 4:31

1

Solved

Understanding Java 17 Vector slowness and performance with pow operator

I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtainin...

java performance vectorization simd java-17

Lueck asked 10/10, 2022 at 7:4

3

Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise

I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work. The algorithm: short bitScanForward(int16_t bb) { constexpr uint16_t two = static_cast<u...

c++bit-manipulation vectorization simd sse

Alienist asked 3/10, 2022 at 3:31

8

Solved

SIMD programming languages [closed]

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on prog...

programming-languages sse simd ispc

Titmouse asked 13/9, 2009 at 12:50

1

Solved

How do you handle indivisible vector lengths with SIMD intrinsics, array not a multiple of vector width?

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays tha...

c++vectorization simd intrinsics avx

Meingoldas asked 16/9, 2022 at 3:18

2

Solved

Can adjacent transform be speed up over zip transform?

Doing a zip transform with a c++ SIMD header library we might have the following sudo code. // using xsimd binary_op = [](const auto& a, const auto& b){ return ...; } float* a, b, res; ... ...

simd

Pity asked 4/9, 2022 at 23:23

4

Solved

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My at...

x86 simd avx vector-processing avx2

Gage asked 20/3, 2012 at 21:48

2

What exactly do the gcc compiler switches (-mavx -mavx2 -mavx512f) do?

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all...

gcc simd avx instruction-set avx512

Stempien asked 22/2, 2022 at 22:56

4

Solved

Fast 7x7 2D median filter in C and C++

I'm trying to convert the following code from MATLAB to C++: function data = process(data) data = medfilt2(data, [7 7], 'symmetric'); mask = fspecial('gaussian', [35 35], 12); data = imfilter(da...

c++c opencv image-processing simd

Heteronym asked 23/2, 2016 at 11:25

1

Solved

Cast array of wrapper structs to SIMD vector

Say I have a wrapper struct, serving as a phantom type. struct Wrapper { float value; } Is it legal to load an array of this struct directly into an SIMD intrinsic type such as __m256? For exampl...

c++language-lawyer undefined-behavior simd intrinsics

Muriate asked 27/6, 2022 at 21:14

1

Solved

Are the xgetbv and CPUID checks sufficient to guarantee AVX2 support?

In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addi...

x86 x86-64 simd avx2

Brachypterous asked 6/6, 2022 at 19:59

2

Solved

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds: SIMD. This is conceptually straightforward, e.g. instead...

vectorization cpu-architecture simd instruction-set

Cawnpore asked 29/5, 2022 at 9:35

simd Questions

Recommended topics

Hot tags