simd Questions

2

Solved

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...
Womenfolk asked 28/7, 2017 at 14:36

1

I have the following loop to calculate basic summary statistics (mean, standard deviation, minimum and maximum) in C++, skipping missing values (x is a double vector): int k = 0; long double sum = ...
Intoxicated asked 22/9, 2023 at 19:54

4

Solved

Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j...
Halfdan asked 24/3, 2013 at 13:23

1

Solved

I have a loop that loads two float* arrays into __m256 vectors and processes them. Following this loop, I have code that loads the balance of values into the vectors and then processes them. So the...
Gass asked 2/6, 2023 at 12:48

2

Solved

There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU. In particular there is a instuction _mm512_dpbusd_epi32...
Irmairme asked 16/6, 2021 at 9:4

0

I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels. I quickly realized that due to lack of accelerated integer division support I need to convert my values ...
Cai asked 14/3, 2023 at 11:37

2

Solved

I would like to implement a parallel matrix-vector multiplication for a fixed size matrix (~3500x3500 floats) optimized for my CPUs and cache layout (AMD Zen 2/4) that is repeatedly executed for ch...

2

Solved

Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives. For example: In [20]: A = np.random.rand(1000) In [21]: B = np.random.rand(...
Lottielotto asked 24/2, 2023 at 11:48

1

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructio...
Hogan asked 9/7, 2019 at 11:42

4

Solved

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics. I would like ...
Fronton asked 17/12, 2022 at 4:41

3

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8: static inline uint64_t Compress8x7bit(uint64_t x) { x = ((x & 0x7F00...
Shake asked 19/12, 2022 at 5:14

1

Solved

Example: https://www.godbolt.org/z/ahfcaj7W8 From https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Optimize-Options.html It says -ftree-loop-vectorize      Perform loop vectorization on trees. This f...
Propitious asked 23/12, 2022 at 10:30

1

Solved

The motivation for this question The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if th...
Arianearianie asked 13/12, 2022 at 13:5

2

Solved

GCC compiler provides a set of builtins to test some processor features, like availability of certain instruction sets. But, according to this thread we also may know certain cpu features may be no...
Cyte asked 8/2, 2018 at 4:31

1

Solved

I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtainin...
Lueck asked 10/10, 2022 at 7:4

3

I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work. The algorithm: short bitScanForward(int16_t bb) { constexpr uint16_t two = static_cast<u...
Alienist asked 3/10, 2022 at 3:31

8

Solved

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on prog...
Titmouse asked 13/9, 2009 at 12:50

1

Solved

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays tha...
Meingoldas asked 16/9, 2022 at 3:18

2

Solved

Doing a zip transform with a c++ SIMD header library we might have the following sudo code. // using xsimd binary_op = [](const auto& a, const auto& b){ return ...; } float* a, b, res; ... ...
Pity asked 4/9, 2022 at 23:23

4

Solved

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My at...
Gage asked 20/3, 2012 at 21:48

2

I explicitly use the Intel SIMD extensions intrinsic in my C/C++ code. In order to compile the code I need to specify -mavx, or -mavx512, or something similar on the command line. I'm good with all...
Stempien asked 22/2, 2022 at 22:56

4

Solved

I'm trying to convert the following code from MATLAB to C++: function data = process(data) data = medfilt2(data, [7 7], 'symmetric'); mask = fspecial('gaussian', [35 35], 12); data = imfilter(da...
Heteronym asked 23/2, 2016 at 11:25

1

Solved

Say I have a wrapper struct, serving as a phantom type. struct Wrapper { float value; } Is it legal to load an array of this struct directly into an SIMD intrinsic type such as __m256? For exampl...
Muriate asked 27/6, 2022 at 21:14

1

Solved

In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addi...
Brachypterous asked 6/6, 2022 at 19:59

2

Solved

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds: SIMD. This is conceptually straightforward, e.g. instead...
Cawnpore asked 29/5, 2022 at 9:35

© 2022 - 2024 — McMap. All rights reserved.