simd - 3 - McMap

7

Solved

Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?

I was reading Agner Fog's optimization manuals, and I came across this example: double data[LEN]; void compute() { const double A = 1.1, B = 2.2, C = 3.3; int i; for(i=0; i<LEN; i++) { dat...

assembly optimization x86-64 cpu-architecture simd

Mikamikado asked 19/5, 2022 at 14:39

1

Solved

Why does SIMD have single data instructions when it's called SIMD?

I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions? For example, vaddss is the single data equivalent of the multiple data v...

cpu-architecture simd sse cpu-registers avx

Asmodeus asked 27/5, 2022 at 1:9

5

Solved

Does rewriting memcpy/memcmp/... with SIMD instructions make sense?

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why doesn't GCC generate SIMD instructions for these library functions by default? Also, are t...

performance sse simd

Scorn asked 16/3, 2011 at 5:21

6

Solved

How to use the Intel AVX in Java?

How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.

java simd avx

Tatouay asked 27/12, 2014 at 9:17

3

Solved

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm lookin...

gcc assembly x86 sse simd

Shanty asked 23/9, 2008 at 22:49

9

Solved

What is "vectorization"?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorizatio...

vectorization simd auto-vectorization

Scenarist asked 14/9, 2009 at 15:7

3

How to create a left-packed vector of indices of the 0s in one SIMD vector?

Please tell me, I can't figure it out myself: Here I have __m128i SIMD vector - each of the 16 bytes contains the following value: 1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1 Is it possible to somehow transf...

c++c simd avx2

Fredericton asked 3/5, 2022 at 10:47

5

Solved

Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. R...

x86 simd avx exponential avx2

Tarpaulin asked 19/2, 2018 at 10:8

1

Solved

AVX divide __m256i packed 32-bit integers by two (no AVX2)

I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2. As far as I know, my options are: Drop down t...

c++simd sse avx sse2

Aforethought asked 30/4, 2022 at 22:46

1

Under what conditions does a C++ compiler use floating-point pipelines to do integer division with run-time-known values for higher performance?

For example, https://godbolt.org/z/W5GbYxo7o #include<cstdint> void divTest1(int * const __restrict__ val1, int * const __restrict__ val2, int * const __restrict__ val3) { for(int i=0;i<...

c++floating-point integer simd integer-division

Schizophrenia asked 2/5, 2022 at 13:39

6

Solved

Fastest Implementation of the Natural Exponential Function Using SSE

I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ). I have an implementation which is quick but seems to be very low in...

c optimization vectorization sse simd

Dense asked 30/10, 2017 at 22:48

3

Solved

Loop vectorization - counting matches of 7-byte records with masking

I have a fairly simple loop: auto indexRecord = getRowPointer(0); bool equals; // recordCount is about 6 000 000 for (int i = 0; i < recordCount; ++i) { equals = BitString::equals(SelectMask, i...

c++gcc vectorization simd bitmap-index

Unitarianism asked 6/4, 2022 at 20:38

2

Solved

Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?

As the title reads, if a 256-bit SIMD register is: 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | How can I efficiently get the index of the first non-zero element (i.e. the index 2 of the first 1)? The most st...

x86 bit-manipulation simd intrinsics avx

Uraemia asked 14/10, 2016 at 0:1

3

Solved

What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?

What is the main difference between instructions through using memory marked as WB (write back) and WC (write combine): What is different between MOVDQA and MOVNTDQA, and what is different between ...

assembly x86 sse simd avx

Immoderacy asked 26/9, 2013 at 18:16

2

A64 Neon SIMD - 256-bit comparison

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently. Equality (=) For equality, I already got a solution: bool eq256(const UInt256 *lhs, const...

arm comparison simd neon arm64

Conspicuous asked 20/4, 2015 at 8:34

4

Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?

Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this: const char* const str = "12345678"; const char* const b...

c++parsing simd avx2 atoi

Krug asked 22/3, 2022 at 11:0

1

Solved

Handling elements that are odd number using neon intrinsics

I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be hand...

c raspberry-pi simd neon armv8

Bolin asked 11/3, 2022 at 11:8

1

Solved

How to compare two vectors using SIMD and get a strncmp like result?

I want to achieve something like strncmp result but not that complicated I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed ...

c simd avx avx2

Brooking asked 8/2, 2022 at 14:18

0

How to tell GCC's target_clones to compile for all SIMD levels?

GCC has a function attribute target_clones which can be used to create different versions of a function that are compiled to use different instruction sets in such a way that, when the binary is ex...

c++c gcc simd

Danyelldanyelle asked 5/2, 2022 at 18:10

2

Modern approach to making std::vector allocate aligned memory

The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed. ...

c++c++17 stdvector simd memory-alignment

Loveinidleness asked 11/2, 2020 at 13:19

2

Solved

Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values

I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics. In the following scenario, I failed ...

c++performance optimization simd avx2

Charqui asked 25/1, 2022 at 18:15

5

Solved

Fastest way to do horizontal SSE vector sum (or other reduction)

Given a vector of three (or four) floats. What is the fastest way to sum them? Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it? Wh...

assembly optimization floating-point sse simd

Emma asked 9/8, 2011 at 13:16

2

How to find the first nonzero in an array efficiently?

Suppose we want to quickly find the index of the first nonzero element in an array, to the effect of fn leading_zeros(arr: &[u32]) -> Option<usize> { arr.iter().position(|&x| x !=...

rust simd

Dictatorial asked 27/12, 2021 at 18:3

4

Solved

AVX2: BitScanReverse or CountLeadingZeros on 8 bit elements in AVX register

I would like to extract the index of the highest set bit in a 256 bit AVX register with 8 bit elements. I could neither find a bsr nor a clz implementation for this. For clz with 32 bit elements, t...

c++simd intrinsics avx avx2

Mummify asked 30/8, 2021 at 13:32

2

C# - Construct a signal Vector<T> from an integer bitmask

I have some integer value representing a bitmask, for example 154 = 0b10011010, and I want to construct a corresponding signal Vector<T> instance <0, -1, 0, -1, -1, 0, 0, -1> (note the ...

c#vector simd intrinsics bitmask

Encyclical asked 21/12, 2021 at 13:16

simd Questions

Recommended topics

Hot tags