sse - McMap

1

Solved

Why do modern compilers prefer SSE over FPU for single floating-point operations

I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code: fl...

c assembly floating-point sse x87

Deposit asked 11/9, 2024 at 11:52

3

Solved

Push XMM register to the stack

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general pur...

assembly x86 simd sse

Footcandle asked 15/4, 2012 at 12:13

0

Why does removing instructions from my SSE intrinsic function make it slower?

Please note that this question is not about YUV422 to RGB conversion! I have this code for a pixel order YUV422 to RGB conversion. static void yuv422ToRGB(unsigned char* img, int width, int height...

c++optimization clang sse

Cocotte asked 20/6, 2024 at 15:3

2

Solved

Divide 8-bit integers by 4 (or shift) using SSE

How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?

c++x86 sse simd intrinsics

Betimes asked 9/1, 2017 at 19:32

4

Solved

Improve SSE (SSSE3) YUV to RGB code

I am looking to optimise some SSE code I wrote for converting YUV to RGB (both planar and packed YUV functions). I am using SSSE3 at the moment, but if there are useful functions from later SSE ve...

optimization assembly rgb sse yuv

Arman asked 31/12, 2010 at 22:20

1

Solved

Can std::replace implementation make redundant writes to the passed array?

std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler). The vectorized implementation would compare and replace several ele...

c++language-lawyer vectorization sse avx

Danille asked 2/3, 2024 at 10:39

4

Solved

What is the meaning of "non temporal" memory accesses in x86

This is a somewhat low-level question. In x86 assembly there are two SSE instructions: MOVDQA xmmi, m128 and MOVNTDQA xmmi, m128 The IA-32 Software Developer's Manual says that the NT i...

x86 sse assembly

Millisent asked 31/8, 2008 at 20:18

6

Solved

Benefits of x87 over SSE

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87?...

x86 x86-64 sse fpu x87

Multipara asked 4/12, 2009 at 3:33

1

Solved

How to vectorize a vector-matrix product with SSE?

I have this function in C++ void routine2(float alpha, float beta) { unsigned int i, j; for (i = 0; i < N; i++) for (j = 0; j < N; j++) w[i] = w[i] - beta + alpha * A[i][j] * x[j]; } ...

c++matrix-multiplication simd sse dot-product

Hebetate asked 19/12, 2023 at 21:4

1

Best way to do a packed 16 element blend using SSE

I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used. void packedBlend16(uint8_t mask...

assembly x86 x86-64 intel sse

Cherie asked 16/5, 2020 at 19:52

4

Solved

How to efficiently perform double/int64 conversions with SSE/AVX?

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...

c++floating-point sse simd avx

Apotheosize asked 14/12, 2016 at 14:9

2

Solved

sse/avx equivalent for neon vuzp

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...

sse simd neon avx

Womenfolk asked 28/7, 2017 at 14:36

4

Solved

How to use align-data-move SSE in Delphi XE3?

I was trying to run the following, type Vector = array [1..4] of Single; {$CODEALIGN 16} function add4(const a, b: Vector): Vector; register; assembler; asm movaps xmm0, [a] movaps xmm1, [b] ...

delphi assembly sse basm

Coulter asked 4/4, 2013 at 1:57

2

Solved

What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....

c++assembly sse avx avx2

Grasmere asked 28/1, 2019 at 19:9

4

Solved

Can counting byte matches between two strings be optimized using SIMD?

Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j...

c++optimization x86-64 sse simd

Halfdan asked 24/3, 2013 at 13:23

3

Solved

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using t...

c++max sse minimum avx

Garrotte asked 7/3, 2014 at 17:17

4

Solved

bitpack ascii string into 7-bit binary blob using SIMD

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics. I would like ...

c ascii simd sse intrinsics

Fronton asked 17/12, 2022 at 4:41

3

Solved

Get sum of values stored in __m256d with SSE/AVX

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm25...

c++optimization sse avx avx2

Hendecahedron asked 20/4, 2018 at 12:27

6

Solved

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

arm sse neon

Bluster asked 8/8, 2012 at 18:33

7

Solved

SSE instructions: which CPUs can do atomic 16B memory operations?

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory l...

concurrency x86 thread-safety atomic sse

Deering asked 4/10, 2011 at 9:48

7

Solved

How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUIn...

c++sse instruction-set avx cpuid

Yuk asked 25/5, 2011 at 8:49

3

Solved

SSE: convert __m128 to float

I have the following piece of C code: __m128 pSrc1 = _mm_set1_ps(4.0f); __m128 pDest; int i; for (i=0;i<100;i++) { m1 = _mm_mul_ps(pSrc1, pSrc1); m2 = _mm_mul_ps(pSrc1, pSrc1); m3 = _mm_ad...

c++c sse

Astera asked 16/1, 2013 at 20:49

3

Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise

I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work. The algorithm: short bitScanForward(int16_t bb) { constexpr uint16_t two = static_cast<u...

c++bit-manipulation vectorization simd sse

Alienist asked 3/10, 2022 at 3:31

1

Solved

Is there a way to force visual studio to generate aligned instructions from SSE intrinsics?

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead. Since not all compilers a...

visual-studio visual-c++sse intrinsics memory-alignment

Sipper asked 15/5, 2020 at 9:32

8

Solved

SIMD programming languages [closed]

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on prog...

programming-languages sse simd ispc

Titmouse asked 13/9, 2009 at 12:50

sse Questions

Recommended topics

Hot tags