simd - 7 - McMap

1

Solved

Fatest way to populate Span<int> with an integer enumeration in .NET?

I am looking for the fastest C# / .NET Core method capable of filling a Span<int> with the enumeration 0, 1, 2, 3, ... The naive for loop - see below - is already plenty fast, but there is pr...

c#.net simd

Exocarp asked 29/10, 2019 at 18:49

4

Solved

How to instruct compiler to generate unaligned loads for __m128

I've got some code that works with __m128 values. I'm using x86-64 SSE intrinsics on these values and I find that if the values are unaligned in memory I get a crash. This is due to my compiler (cl...

c++x86-64 sse simd intrinsics

Wealth asked 24/11, 2015 at 9:4

1

Solved

Does SSE/AVX provide a means of determining if a result was rounded up?

One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up. Does SSE/AVX provide any such indication for scalar operations? I did no...

x86 rounding sse simd avx

Perpetuity asked 23/10, 2019 at 13:51

1

Solved

AVX2 column population count algorithm over each bit-column separately

For a project I'm working on I need to count the number of set bits per column in ripped PDF image data. I'm trying to get the total set bit count for each column in the entire PDF job (all pages)...

c++visual-c++x86 simd avx2

Steinberg asked 21/10, 2019 at 12:18

2

Solved

"Safe" SIMD arithmetic on aligned vectors of odd size?

Let's say I have some 16-bytes aligned structure, that just wraps 3xFloat32 array: #[repr(C, align(16))] pub struct Vector(pub [f32; 3]); Now I want to divide two instances of it, like this: us...

rust floating-point sse simd floating-point-exceptions

Quadriplegia asked 8/10, 2019 at 6:39

3

Solved

Horizontal add with __m512 (AVX512)

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm...

simd intrinsics avx512

Hardej asked 12/11, 2014 at 20:58

3

How to count character occurrences using SIMD

I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions. unsigned long long char_count_AVX...

c simd avx avx2

Annmaria asked 5/2, 2019 at 18:47

2

Solved

SIMD - AVX - masking with non-zero value instead of highest bit

I have AVX (no AVX2 or AVX-512). I have a vector with 32bit values (only 4 lowest bits are used, rest is always zero): [ 1010, 0000, 0000, 0000, 0000, 1010, 1010, 0000] Internally, I keep vector...

c simd avx

Ricercare asked 22/8, 2019 at 13:39

1

Solved

What is the difference between shuffle and permute

In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., _mm_shuffle_epi32) and permute (e.g., _mm_permute_...

x86 intel simd naming avx

Laudation asked 15/8, 2019 at 3:8

5

Solved

Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probabl...

c optimization x86 x86-64 simd

Bamby asked 9/3, 2019 at 20:13

3

Solved

Using SSE in C#

I'm currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code itself is also perfect for...

c#sse simd

Ecumenicism asked 27/5, 2013 at 12:48

3

Solved

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reaso...

c++x86 simd intrinsics avx

Bindweed asked 12/7, 2019 at 20:22

2

Solved

Memory Coalescing vs. Vectorized Memory Access

I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++. It is my understanding that: Memory coalescing is a run-tim...

cuda gpu cpu-architecture simd coalescing

Elissa asked 10/7, 2019 at 8:26

8

Solved

How to determine if memory is aligned?

I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this: void sse_func(const float* co...

c optimization memory sse simd

Waistline asked 13/12, 2009 at 23:15

3

Solved

SIMD for float threshold operation

I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation: void func(const double* left, con...

c++double vectorization sse simd

Pronucleus asked 19/6, 2019 at 14:29

2

Solved

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction ...

intel ram simd avx avx2

Onetoone asked 14/2, 2014 at 8:39

2

Solved

SSE instruction to check if byte array is zeroes C#

Suppose I have a byte[] and want to check if all bytes are zeros. For loop is an obvious way to do it, and LINQ All() is a fancy way to do it, but highest performance is critical. How can I use M...

c#arrays performance mono simd

Lettered asked 23/10, 2015 at 3:50

3

How to clear the upper 128 bits of __m256 value?

How can I clear the upper 128 bits of m2: __m256i m2 = _mm256_set1_epi32(2); __m128i m1 = _mm_set1_epi32(1); m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2)); m2 = _mm256_castsi128_si256(m...

c x86 simd avx avx2

Autogenesis asked 27/1, 2014 at 15:33

2

How to make premultiplied alpha function faster using SIMD instructions?

I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = wi...

c++x86 sse simd avx

Allhallowtide asked 3/6, 2019 at 15:59

1

Solved

Is it possible to convince clang to auto-vectorize this code without using intrinsics?

Imagine I have this naive function to detect sphere overlap. The point of this question is not really to discuss the best way to do hit testing on spheres, so this is just for illustration. inline...

vectorization simd llvm-clang micro-optimization avx2

Besmirch asked 21/5, 2019 at 5:9

3

Solved

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in ...

c x86 simd intrinsics sse2

Acidify asked 19/9, 2012 at 13:13

1

Solved

__m256i version of _mm_test_all_zeros

I know how to test if an _m128i register is all zero with the _mm_test_all_zeros intrinsic. What is the AVX2 / __m256i version of this intrinsic? If one isn't available, what is the fastest way to...

simd intrinsics avx avx2

Redo asked 28/5, 2019 at 16:24

1

Solved

Compare two __m128i values for total order

I need a way to compare values of type __m128i in C++ for a total order between any values of type __m128i. The type of order doesn't matter as long as it establishes a total order between all valu...

c++x86 x86-64 simd intrinsics

Glossal asked 28/5, 2019 at 11:39

4

Solved

GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?

I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is f...

c gcc comparison vectorization simd

Drysalter asked 23/7, 2015 at 20:20

2

Solved

Could the "reduce" function be parallelized in Functional Programming?

In Functional Programming, one benefit of the map function is that it could be implemented to be executed in parallel. So on a 4 cores hardware, this code and a parallel implementation of map woul...

parallel-processing functional-programming simd

Diarmit asked 6/2, 2016 at 22:9

simd Questions

Recommended topics

Hot tags