simd Questions

2

Solved

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to...
Basinet asked 10/9, 2019 at 12:32

2

Solved

I have a problem with converting the image data stored in byte[] array to grayscale. I want to use vector SIMD operations because in future a need to write ASM and C++ DLL files to measure operatio...
Jonahjonas asked 15/11, 2019 at 16:45

0

Suppose I want to load four consecutive aarch64 vector registers with values from consecutive memory locations. One way to do this is ldp q0, q1, [x0] ldp q2, q3, [x0, 32] According to the ARM opt...
Engird asked 4/7, 2020 at 22:7

2

Solved

I'm working with integers and SSE and have become very confused about how endianness affects moving data in and out of registers. My initial, wrong, understanding Initially my understanding was as ...
Disremember asked 4/6, 2014 at 18:39

1

Solved

I am trying to get started with AVX512 intrinsics by reading the Intel Intrinsics Guide but so far I have found that it does not define the named datatypes or the pseudocode syntax used for explana...
Fusiform asked 12/6, 2020 at 17:22

1

Solved

I'm trying to find a concise example which shows auto vectorization in java on a x86-64 system. I've implemented the below code using y[i] = y[i] + x[i] in a for loop. This code can benefit from a...
Pitching asked 13/1, 2020 at 22:51

4

Solved

I just noted that one of the first languages for the Connection-Machine of W.D. Hillis was *Lisp, an extension of Common Lisp with parallel constructs. The Connection-Machine was a massively parall...
Sempach asked 18/5, 2011 at 15:18

5

Solved

I deal with image processing. I need to divide 16-bit integer SSE vector by 255. I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2. I know of course th...
Guanajuato asked 9/2, 2016 at 6:28

1

Solved

Most architectures have different set of registers for storing regular integers and floating points. From a binary storage point of view, it shouldn't matter where things are stored right? it's jus...
Larger asked 27/5, 2020 at 15:44

1

This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifyi...
Dundalk asked 24/5, 2020 at 5:21

0

I recently made some vector-code and an appropriate godbolt example. typedef float v8f __attribute__((vector_size(32))); typedef unsigned v8u __attribute__((vector_size(32))); v8f f(register v8f...
Novokuznetsk asked 23/5, 2020 at 16:54

4

Solved

I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops ti...
Laurilaurianne asked 9/4, 2020 at 13:21

3

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since m...
Ericaericaceous asked 7/6, 2012 at 13:13

1

Solved

I've written vectorized versions of some functions that are currently the bottleneck of an algorithm, using Eigen's facilities to do so. I've also checked that AVX is enabled by making sure that E...
Frostwork asked 12/1, 2020 at 23:50

8

Solved

If I have a 64-bit integer that I'm interpreting as an array of packed 8-bit integers with 8 elements. I need to subtract the constant 1 from each packed integer while handling overflow without the...
Careaga asked 7/1, 2020 at 23:56

1

Solved

What is the difference between the SIMD and Muti-threading concepts that one comes across in parallel programming paradigm?
Lindeman asked 7/1, 2020 at 6:8

2

Using C#'s Vector<T>, how can we most efficiently vectorize the operation of finding the index of a particular element in a set? As constraints, the set will always be a Span<T> of an ...
Keverne asked 9/7, 2019 at 14:59

1

In terms of SIMD and parallelization, what is the difference between AVX2 and AVX-512? Are they the same thing or different? I just see that double8 is used in AVX-512 and double4 is used for AVX2?...
Vu asked 2/12, 2019 at 20:34

0

I have an interesting problem that can't think of an efficient way of solving with vectorized code. I have a ymm register with 8 32-bit integers, where each integer is made up of: Lower 24 bits ...
Complaisance asked 1/12, 2019 at 8:42

1

Solved

I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task. Here is a snippet o...
Brewer asked 1/12, 2019 at 6:36

12

This is the message received from running a script to check if Tensorflow is working: I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally I te...
Blueberry asked 22/12, 2016 at 23:21

2

Solved

In a simd-tutorial i found the following code-snippet. void simd(float* a, int N) { // We assume N % 4 == 0. int nb_iters = N / 4; __m128* ptr = reinterpret_cast<__m128*>(a); // (...
Affirmatory asked 18/11, 2019 at 8:59

3

Solved

With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element. Is t...
Skilken asked 12/11, 2019 at 16:46

3

Solved

I can dump the all the integer registers in gdb with just: info registers for the xmm registers (intel) I need a file like: print $xmm0 print $xmm1 ... print $xmm15 and then source that file....
Silvers asked 30/3, 2012 at 19:52

1

Solved

I've looked everywhere and I still can't figure it out. I know of two associations you can make with streams: Wrappers for backing data stores meant as an abstraction layer between consumers and ...
Nonpros asked 4/11, 2019 at 20:17

© 2022 - 2024 — McMap. All rights reserved.