simd - 6 - McMap

2

Solved

Writing a portable SSE/AVX version of std::copysign

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to...

c++x86-64 sse simd avx

Basinet asked 10/9, 2019 at 12:32

2

Solved

How to convert byte array of image pixels data to grayscale using vector SSE operation

I have a problem with converting the image data stored in byte[] array to grayscale. I want to use vector SIMD operations because in future a need to write ASM and C++ DLL files to measure operatio...

c#image-processing vectorization sse simd

Jonahjonas asked 15/11, 2019 at 16:45

0

arm64 assembly: LDP vs. LD4 execution time

Suppose I want to load four consecutive aarch64 vector registers with values from consecutive memory locations. One way to do this is ldp q0, q1, [x0] ldp q2, q3, [x0, 32] According to the ARM opt...

performance assembly arm simd arm64

Engird asked 4/7, 2020 at 22:7

2

Solved

How does endianness work with SIMD registers?

I'm working with integers and SSE and have become very confused about how endianness affects moving data in and out of registers. My initial, wrong, understanding Initially my understanding was as ...

x86 sse endianness simd

Disremember asked 4/6, 2014 at 18:39

1

Solved

How to read the "Intel Intrinsics Guide"?

I am trying to get started with AVX512 intrinsics by reading the Intel Intrinsics Guide but so far I have found that it does not define the named datatypes or the pseudocode syntax used for explana...

intel simd intrinsics

Fusiform asked 12/6, 2020 at 17:22

1

Solved

Java auto vectorization example

I'm trying to find a concise example which shows auto vectorization in java on a x86-64 system. I've implemented the below code using y[i] = y[i] + x[i] in a for loop. This code can benefit from a...

java assembly vectorization x86-64 simd

Pitching asked 13/1, 2020 at 22:51

4

Solved

Any Lisp extensions for CUDA?

I just noted that one of the first languages for the Connection-Machine of W.D. Hillis was *Lisp, an extension of Common Lisp with parallel constructs. The Connection-Machine was a massively parall...

lisp cuda parallel-processing gpgpu simd

Sempach asked 18/5, 2011 at 15:18

5

Solved

How to divide 16-bit integer by 255 with using SSE?

I deal with image processing. I need to divide 16-bit integer SSE vector by 255. I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2. I know of course th...

c++image-processing sse simd sse2

Guanajuato asked 9/2, 2016 at 6:28

1

Solved

Why floating point registers are different than general purpose ones

Most architectures have different set of registers for storing regular integers and floating points. From a binary storage point of view, it shouldn't matter where things are stored right? it's jus...

floating-point x86-64 simd cpu-registers

Larger asked 27/5, 2020 at 15:44

1

find nan in array of doubles using simd

This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifyi...

c nan sse simd avx

Dundalk asked 24/5, 2020 at 5:21

0

gcc optimization better at -O0 than -O3

I recently made some vector-code and an appropriate godbolt example. typedef float v8f __attribute__((vector_size(32))); typedef unsigned v8u __attribute__((vector_size(32))); v8f f(register v8f...

gcc compiler-optimization simd avx2

Novokuznetsk asked 23/5, 2020 at 16:54

4

Solved

Micro Optimization of a 4-bucket histogram of a large array or list

I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops ti...

c#optimization histogram simd micro-optimization

Laurilaurianne asked 9/4, 2020 at 13:21

3

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since m...

x86 sse simd sse2 sse3

Ericaericaceous asked 7/6, 2012 at 13:13

1

Solved

Ensuring that Eigen uses AVX vectorization for a certain operation

I've written vectorized versions of some functions that are currently the bottleneck of an algorithm, using Eigen's facilities to do so. I've also checked that AVX is enabled by making sure that E...

c++vectorization eigen simd avx

Frostwork asked 12/1, 2020 at 23:50

8

Solved

Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD

If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the...

c++c bit-manipulation simd swar

Careaga asked 7/1, 2020 at 23:56

1

Solved

Difference between SIMD and Multi-threading [closed]

What is the difference between the SIMD and Muti-threading concepts that one comes across in parallel programming paradigm?

multithreading parallel-processing simd

Lindeman asked 7/1, 2020 at 6:8

2

Use C# Vector<T> SIMD to find index of matching element

Using C#'s Vector<T>, how can we most efficiently vectorize the operation of finding the index of a particular element in a set? As constraints, the set will always be a Span<T> of an ...

c#vectorization simd intrinsics dot-product

Keverne asked 9/7, 2019 at 14:59

1

What is the difference between AVX2 and AVX-512?

In terms of SIMD and parallelization, what is the difference between AVX2 and AVX-512? Are they the same thing or different? I just see that double8 is used in AVX-512 and double4 is used for AVX2?...

opencl simd avx avx2 avx512

Vu asked 2/12, 2019 at 20:34

0

Efficient Way of shuffling 3 bit values inside an AVX2/ymm register

I have an interesting problem that can't think of an efficient way of solving with vectorized code. I have a ymm register with 8 32-bit integers, where each integer is made up of: Lower 24 bits ...

c sse simd avx avx2

Complaisance asked 1/12, 2019 at 8:42

1

Solved

left shift of 128 bit number using AVX2 instruction

I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet o...

c++simd intrinsics avx avx2

Brewer asked 1/12, 2019 at 6:36

12

How to compile Tensorflow with SSE4.2 and AVX instructions?

This is the message received from running a script to check if Tensorflow is working: I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally I te...

tensorflow x86 compiler-optimization simd compiler-options

Blueberry asked 22/12, 2016 at 23:21

2

Solved

Is casting to simd-type undefined behaviour in C++? [duplicate]

In a simd-tutorial i found the following code-snippet. void simd(float* a, int N) { // We assume N % 4 == 0. int nb_iters = N / 4; __m128* ptr = reinterpret_cast<__m128*>(a); // (...

c++sse undefined-behavior simd intrinsics

Affirmatory asked 18/11, 2019 at 8:59

3

Solved

Count leading zero bits for each element in AVX2 vector, emulate _mm256_lzcnt_epi32

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is t...

bit-manipulation simd avx avx2 avx512

Skilken asked 12/11, 2019 at 16:46

3

Solved

How to dump all the XMM registers in gdb?

I can dump the all the integer registers in gdb with just: info registers for the xmm registers (intel) I need a file like: print $xmm0 print $xmm1 ... print $xmm15 and then source that file....

x86 gdb simd sse cpu-registers

Silvers asked 30/3, 2012 at 19:52

1

Solved

What does the Streaming stand for in Streaming SIMD Extensions (SSE)?

I've looked everywhere and I still can't figure it out. I know of two associations you can make with streams: Wrappers for backing data stores meant as an abstraction layer between consumers and ...

stream multiprocessing sse simd instruction-set

Nonpros asked 4/11, 2019 at 20:17

simd Questions

Recommended topics

Hot tags