simd Questions
1
Solved
I am looking for the fastest C# / .NET Core method capable of filling a Span<int> with the enumeration 0, 1, 2, 3, ... The naive for loop - see below - is already plenty fast, but there is pr...
4
Solved
I've got some code that works with __m128 values. I'm using x86-64 SSE intrinsics on these values and I find that if the values are unaligned in memory I get a crash. This is due to my compiler (cl...
Wealth asked 24/11, 2015 at 9:4
1
Solved
One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up.
Does SSE/AVX provide any such indication for scalar operations?
I did no...
1
Solved
For a project I'm working on I need to count the number of set bits per column in ripped PDF image data.
I'm trying to get the total set bit count for each column in the entire PDF job (all pages)...
Steinberg asked 21/10, 2019 at 12:18
2
Solved
Let's say I have some 16-bytes aligned structure, that just wraps 3xFloat32 array:
#[repr(C, align(16))]
pub struct Vector(pub [f32; 3]);
Now I want to divide two instances of it, like this:
us...
Quadriplegia asked 8/10, 2019 at 6:39
3
Solved
How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm...
Hardej asked 12/11, 2014 at 20:58
3
I am given a array of lowercase characters (up to 1.5Gb) and a character c. And I want to find how many occurrences are of the character c using AVX instructions.
unsigned long long char_count_AVX...
2
Solved
I have AVX (no AVX2 or AVX-512). I have a vector with 32bit values (only 4 lowest bits are used, rest is always zero):
[ 1010, 0000, 0000, 0000, 0000, 1010, 1010, 0000]
Internally, I keep vector...
1
Solved
In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., _mm_shuffle_epi32) and permute (e.g., _mm_permute_...
5
Solved
(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probabl...
Bamby asked 9/3, 2019 at 20:13
3
Solved
I'm currently coding an application in C# which could benefit a great deal from using SSE, as a relative small piece of code causes 90-95% of the execution time. The code itself is also perfect for...
3
Solved
I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reaso...
Bindweed asked 12/7, 2019 at 20:22
2
Solved
I am trying to understand the relationship between memory coalescing on NVIDIA GPUs/CUDA and vectorized memory access on x86-SSE/C++.
It is my understanding that:
Memory coalescing is a run-tim...
Elissa asked 10/7, 2019 at 8:26
8
Solved
I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
void sse_func(const float* co...
Waistline asked 13/12, 2009 at 23:15
3
Solved
I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:
void func(const double* left, con...
Pronucleus asked 19/6, 2019 at 14:29
2
Solved
Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.
What happens when the data to be loaded exists in different cache-lines? Is the instruction ...
2
Solved
Suppose I have a byte[] and want to check if all bytes are zeros. For loop is an obvious way to do it, and LINQ All() is a fancy way to do it, but highest performance is critical.
How can I use M...
Lettered asked 23/10, 2015 at 3:50
3
How can I clear the upper 128 bits of m2:
__m256i m2 = _mm256_set1_epi32(2);
__m128i m1 = _mm_set1_epi32(1);
m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m...
2
I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel).
for (int i = 0, max = wi...
1
Solved
Imagine I have this naive function to detect sphere overlap. The point of this question is not really to discuss the best way to do hit testing on spheres, so this is just for illustration.
inline...
Besmirch asked 21/5, 2019 at 5:9
3
Solved
The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in ...
Acidify asked 19/9, 2012 at 13:13
1
Solved
I know how to test if an _m128i register is all zero with the _mm_test_all_zeros intrinsic.
What is the AVX2 / __m256i version of this intrinsic? If one isn't available, what is the fastest way to...
Redo asked 28/5, 2019 at 16:24
1
Solved
I need a way to compare values of type __m128i in C++ for a total order between any values of type __m128i. The type of order doesn't matter as long as it establishes a total order between all valu...
Glossal asked 28/5, 2019 at 11:39
4
Solved
I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is f...
Drysalter asked 23/7, 2015 at 20:20
2
Solved
In Functional Programming, one benefit of the map function is that it could be implemented to be executed in parallel.
So on a 4 cores hardware, this code and a parallel implementation of map woul...
Diarmit asked 6/2, 2016 at 22:9
© 2022 - 2024 — McMap. All rights reserved.