avx - McMap

1

Solved

Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows ...

x86-64 simd avx avx512

Oxazine asked 10/5, 2020 at 19:28

4

Solved

Horizontal XOR in AVX

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register? The goal is to get the XOR of all 4 64-bit components of an AVX register. ...

c++assembly x86 simd avx

Oviposit asked 5/7, 2017 at 21:0

1

Solved

Can std::replace implementation make redundant writes to the passed array?

std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler). The vectorized implementation would compare and replace several ele...

c++language-lawyer vectorization sse avx

Danille asked 2/3, 2024 at 10:39

4

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0...

bit-manipulation transpose simd avx bitcount

Laaspere asked 23/7, 2018 at 9:37

2

Solved

Is using AVX2 can implement a faster processing of LZCNT on a word array?

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...

x86 simd avx micro-optimization avx2

Milan asked 15/5, 2019 at 15:43

4

Solved

Fastest way to mask out bytes higher than separator position with SIMD

uint8_t data[] = "mykeyxyz:1234\nky:123\n...";. My lines of string has format key:value, where each line has len(key) <= 16 guaranteed. I want to load mykeyxyz into a __m128i, but fill...

c++assembly optimization simd avx

Laird asked 11/1, 2024 at 14:59

1

Looking for an efficient function to find an index of max element in SIMD vector using a library

There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass): i...

c++optimization simd avx vector-class-library

Cocainism asked 23/12, 2023 at 9:46

4

Solved

Packing and de-interleaving two __m256 registers

I have a row-wise array of floats (~20 cols x ~1M rows) from which I need to extract two columns at a time into two __m256 registers. ...a0.........b0...... ...a1.........b1...... // ... ...a7.......

c++x86 simd avx avx2

Marolda asked 27/2, 2017 at 23:58

1

Solved

How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector?

I am learning about C avx intrinsics and I am wondering how this works. I am familiar that I can do something like this: __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0); H...

c simd avx

Paige asked 29/11, 2023 at 20:50

1

What's the difference between the XOR instructions "VPXORD", "VXORPS" and "VXORPD" in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's "VPXORD", and for double "VXORPD", for float "VXO...

x86 cpu-architecture avx avx2 avx512

Halfmoon asked 5/3, 2019 at 18:32

4

Solved

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...

c++floating-point sse simd avx

Apotheosize asked 14/12, 2016 at 14:9

0

Why does MinGW GCC use x87 80bit FP library code for atan2, cos, exp & sin?

I have a curious problem porting working numerical code from Intel 2023 & MSC Visual C++ 2022. The code compiled with GCC is perfectly accurate (too accurate) since some library calls are worki...

gcc floating-point mingw avx x87

Circumpolar asked 25/10, 2023 at 16:36

2

Solved

sse/avx equivalent for neon vuzp

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...

sse simd neon avx

Womenfolk asked 28/7, 2017 at 14:36

3

Solved

Why gcc is so much worse at std::vector<float> vectorization of a conditional multiply than clang?

Consider following float loop, compiled using -O3 -mavx2 -mfma for (auto i = 0; i < a.size(); ++i) { a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0; } Clang done perfect job at vectorizing it. It...

c++gcc vectorization compiler-optimization avx

Desiccant asked 13/7, 2023 at 23:17

5

Solved

How to get data out of AVX registers?

Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX in...

c++visual-c++avx fma

Anticlimax asked 3/6, 2016 at 10:51

2

Solved

why does gcc auto-vectorization for tigerlake use ymm not zmm registers

I wanted to explore auto-vectorization by gcc (10.3). I have the following short program (see https://godbolt.org/z/5v9a53aj6) which computes the sum of all elements of a vector: #include <stdio...

c gcc avx avx512 auto-vectorization

Tobacco asked 21/10, 2022 at 10:12

2

Solved

What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....

c++assembly sse avx avx2

Grasmere asked 28/1, 2019 at 19:9

1

Solved

L1 Cache Usage in Optimised matrix multiplication micro-kernel in C++

I was tasked with implementing an optimised matrix multiplication micro-kernel that computes C = A*B in C++ starting from the following snippet of code. I am getting some counter intuitive behaviou...

c++optimization matrix-multiplication avx cpu-cache

Owing asked 4/3, 2023 at 17:52

1

C# and SIMD: High and low speedups. What is happening?

Introduction of the problem I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructio...

c#performance x86-64 simd avx

Hogan asked 9/7, 2019 at 11:42

3

Solved

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using t...

c++max sse minimum avx

Garrotte asked 7/3, 2014 at 17:17

3

Solved

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm25...

c++optimization sse avx avx2

Hendecahedron asked 20/4, 2018 at 12:27

3

How to enable AVX / AVX2 in VirtualBox 6.1.16 with Ubuntu 20.04 64bit?

TL;DR: Tensorflow 1.15 crashes on my virtual machine when imported by Python (error message is Illegal instruction (core dumped)), very probably thanks to AVX and AVX2 being disabled on it. My hos...

tensorflow virtual-machine virtualbox avx avx2

Abiogenetic asked 18/1, 2021 at 18:56

7

Solved

How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUIn...

c++sse instruction-set avx cpuid

Yuk asked 25/5, 2011 at 8:49

5

Solved

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires ...

linux unix avx suse avx2

Kanchenjunga asked 27/5, 2016 at 9:40

1

Solved

How do you handle indivisible vector lengths with SIMD intrinsics, array not a multiple of vector width?

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays tha...

c++vectorization simd intrinsics avx

Meingoldas asked 16/9, 2022 at 3:18

avx Questions

Recommended topics

Hot tags