simd - 4 - McMap

3

Solved

Visual Studio's 'watch' incorrectly shows zero for half of the numbers in a Vector<float>

Is this a bug in the VS 2017 watch, or am I doing something daft? It doesn't show half the contents of a Vector. (On my system, Vector.Count is 8). [Test] public void inspectVector() { var n...

c#visual-studio-2017 simd

Organzine asked 28/7, 2018 at 11:8

1

Solved

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

I have tried speeding up a toy GEMM implementation. I deal with blocks of 32x32 doubles for which I need an optimized MM kernel. I have access to AVX2 and FMA. I have two codes (in ASM, I apologies...

assembly matrix-multiplication simd avx micro-optimization

Oke asked 13/12, 2021 at 20:48

1

What's a "wavefront" in the context of real-time rendering?

I lately came across the term "wavefront" in the context of pixel shader execution on the graphics card. From context I'd assume that a wavefront is a packing of multiple pixels or vertic...

shader directx simd

Eshelman asked 6/12, 2021 at 11:7

3

Solved

Optimization of image resizing (method Nearest) with using SIMD

I know that 'Nearest' method of image resizing is the fastest method. Nevertheless I search way to speed up it. Evident step is a precalculate indices: void CalcIndex(int sizeS, int sizeD, int colo...

c++image-processing simd simd-library synet

Pelagian asked 6/12, 2021 at 11:6

1

Solved

No speedup when summing uint16 vs uint64 arrays with NumPy?

I have to do a large number of operations (additions) on relatively small integers, and I started considering which datatype would give the best performance on a 64 bit machine. I was convinced tha...

python numpy performance compiler-optimization simd

Phototelegraph asked 27/11, 2021 at 10:39

3

Solved

Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions. Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack v...

c++performance simd avx512

Billups asked 20/11, 2021 at 12:9

5

Efficiently shift-or large bit vector

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount ...

c++performance simd sse avx

Durno asked 20/11, 2021 at 7:1

2

Solved

How to vectorise int8 multiplcation in C (AVX2)

How do I vectorize this C function with AVX2? static void propogate_neuron(const short a, const int8_t *b, int *c) { for (int i = 0; i < 32; ++i){ c[i] += a * b[i]; } }

c x86 simd intrinsics avx2

Arhat asked 4/11, 2021 at 23:5

5

Solved

SIMD prefix sum on Intel cpu

I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex: [3, 1, 7, 0, 4, 1, 6, 3] should give: [3, 4, 11, 11, 15, 16, 22, 25] Is there a way to do this usin...

c++sse simd prefix-sum

Phototherapy asked 14/5, 2012 at 16:44

2

Solved

How do I enable SSE4.1 and SSE3 (but NOT AVX) in MSVC

I am trying to enable different simd support using MSVC. There is a page talking about enabling some simd, such as SSE2, AVX, AVX2 https://learn.microsoft.com/en-us/cpp/build/reference/arch-x86?red...

visual-c++sse simd sse4

Landslide asked 24/9, 2020 at 19:59

7

Solved

A better 8x8 bytes matrix transpose with SSE?

I found this post that explains how to transpose an 8x8 bytes matrix with 24 operations, and a few scrolls later there's the code that implements the transpose. However, this method does not exploi...

c matrix optimization sse simd

Jibber asked 10/2, 2017 at 14:51

5

Solved

Getting started with Intel x86 SSE SIMD instructions

I want to learn more about using the SSE. What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals? Mainly I'm interested to wo...

c gcc x86 sse simd

Unsnap asked 7/9, 2009 at 14:42

2

Solved

Do I need to use _mm256_zeroupper in 2021?

From Agner Fog's "Optimizing software in C++": There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when goi...

c++sse simd intrinsics avx

Employment asked 11/8, 2021 at 5:40

1

Solved

Why is there no SIMD functionality in the C++ standard library?

SSE has been around since 1999 and it and its following extensions are one of the most powerful tools for improving the performance of your C++ program. Yet there is no standardized containers/algo...

c++stl simd

Circuit asked 17/12, 2019 at 12:3

6

Solved

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE ...

c++vectorization sse simd avx2

Aphrodisiac asked 29/4, 2016 at 7:30

2

Solved

Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions?

I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use ...

x86-64 intel cpu-architecture simd

Archil asked 28/1, 2015 at 6:14

2

Constexpr and SSE intrinsics

Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr, although "semantically" there is...

c++sse simd constexpr intrinsics

Favourable asked 16/8, 2018 at 14:59

3

Solved

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits. The backing story: What...

x86 simd avx

Photodynamics asked 26/8, 2011 at 20:8

3

Solved

Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop

I wrote a function to add up all the elements of a double[] array using SIMD (System.Numerics.Vector) and the performance is worse than the naïve method. On my computer Vector<double>.Count i...

c#arrays performance simd avx

Duvall asked 19/5, 2021 at 14:57

1

How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

I have this C: #include <stddef.h> size_t findChar(unsigned int length, char* __attribute__((aligned(16))) restrict string) { for (size_t i = 0; i < length; i += 2) { if (string[i] == '[...

c assembly vectorization compiler-optimization simd

Cuprite asked 5/4, 2021 at 20:15

1

Solved

First use of AVX 256-bit vectors slows down 128-bit vector and AVX scalar ops

Originally I was trying to reproduce the effect described in Agner Fog's microarchitecture guide section "Warm-up period for YMM and ZMM vector instructions" where it says that: The proc...

assembly x86-64 sse simd avx

Triviality asked 30/3, 2021 at 15:43

3

Solved

How to convert a binary integer number to a hex string?

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.) Digits can be stored in memory or printed on the...

assembly x86 hex simd avx512

Retentivity asked 17/12, 2018 at 22:14

3

SIMD string to unsigned int parsing in C# performance improvement

I've implemented a method for parsing an unsigned integer string of length <= 8 using SIMD intrinsics available in .NET as follows: public unsafe static uint ParseUint(string text) { fixed (cha...

c#sse simd avx system.numerics

Stephen asked 25/2, 2021 at 15:35

3

Solved

Does AVX/AVX2 "exists" on each core?

So, this AVX thing - it's like a small machine for each core? Or it's just like one engine-unit for whole CPU? Like, can I use it on each core somehow? I'm playing with it, and I'm feeling like I m...

c++cpu-architecture simd avx avx2

Matchmaker asked 20/2, 2021 at 18:25

1

_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?

I have an image processing algorithm to calculate a*b+c*d with AVX. The pseudo code is as follows: float *a=new float[N]; float *b=new float[N]; float *c=new float[N]; float *d=new float[N]; //ass...

gcc sse simd avx micro-optimization

Engage asked 18/2, 2021 at 13:10

simd Questions

Recommended topics

Hot tags