simd Questions
7
Solved
I was reading Agner Fog's optimization manuals, and I came across this example:
double data[LEN];
void compute()
{
const double A = 1.1, B = 2.2, C = 3.3;
int i;
for(i=0; i<LEN; i++) {
dat...
Mikamikado asked 19/5, 2022 at 14:39
1
Solved
I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss is the single data equivalent of the multiple data v...
Asmodeus asked 27/5, 2022 at 1:9
5
Solved
Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?
If so, why doesn't GCC generate SIMD instructions for these library functions by default?
Also, are t...
Scorn asked 16/3, 2011 at 5:21
6
Solved
How do I use the Intel AVX vector instruction set from Java? It's a simple question but the answer seems to be hard to find.
3
Solved
I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm lookin...
9
Solved
Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorizatio...
Scenarist asked 14/9, 2009 at 15:7
3
Please tell me, I can't figure it out myself:
Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
Is it possible to somehow transf...
5
Solved
I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML.
R...
Tarpaulin asked 19/2, 2018 at 10:8
1
Solved
I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2.
As far as I know, my options are:
Drop down t...
1
For example,
https://godbolt.org/z/W5GbYxo7o
#include<cstdint>
void divTest1(int * const __restrict__ val1, int * const __restrict__ val2, int * const __restrict__ val3)
{
for(int i=0;i<...
Schizophrenia asked 2/5, 2022 at 13:39
6
Solved
I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementation which is quick but seems to be very low in...
Dense asked 30/10, 2017 at 22:48
3
Solved
I have a fairly simple loop:
auto indexRecord = getRowPointer(0);
bool equals;
// recordCount is about 6 000 000
for (int i = 0; i < recordCount; ++i) {
equals = BitString::equals(SelectMask, i...
Unitarianism asked 6/4, 2022 at 20:38
2
Solved
As the title reads, if a 256-bit SIMD register is:
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
How can I efficiently get the index of the first non-zero element (i.e. the index 2 of the first 1)? The most st...
Uraemia asked 14/10, 2016 at 0:1
3
Solved
What is the main difference between instructions through using memory marked as WB (write back) and WC (write combine): What is different between MOVDQA and MOVNTDQA, and what is different between ...
2
I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.
Equality (=)
For equality, I already got a solution:
bool eq256(const UInt256 *lhs, const...
Conspicuous asked 20/4, 2015 at 8:34
4
Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this:
const char* const str = "12345678";
const char* const b...
1
Solved
I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be hand...
Bolin asked 11/3, 2022 at 11:8
1
Solved
I want to achieve something like strncmp result but not that complicated
I tried to read https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcmp-avx2.S.html source code but I failed ...
0
GCC has a function attribute target_clones which can be used to create different versions of a function that are compiled to use different instruction sets in such a way that, when the binary is ex...
2
The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed.
...
Loveinidleness asked 11/2, 2020 at 13:19
2
Solved
I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics.
In the following scenario, I failed ...
Charqui asked 25/1, 2022 at 18:15
5
Solved
Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?
Wh...
Emma asked 9/8, 2011 at 13:16
2
Suppose we want to quickly find the index of the first nonzero element in an array, to the effect of
fn leading_zeros(arr: &[u32]) -> Option<usize> {
arr.iter().position(|&x| x !=...
4
Solved
I would like to extract the index of the highest set bit in a 256 bit AVX register with 8 bit elements. I could neither find a bsr nor a clz implementation for this.
For clz with 32 bit elements, t...
Mummify asked 30/8, 2021 at 13:32
2
I have some integer value representing a bitmask, for example 154 = 0b10011010, and I want to construct a corresponding signal Vector<T> instance <0, -1, 0, -1, -1, 0, 0, -1> (note the ...
Encyclical asked 21/12, 2021 at 13:16
© 2022 - 2024 — McMap. All rights reserved.