sse Questions

1

Solved

I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code: fl...
Deposit asked 11/9, 2024 at 11:52

3

Solved

Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed? Ideally I am looking for something like PUSH or POP for general pur...
Footcandle asked 15/4, 2012 at 12:13

0

Please note that this question is not about YUV422 to RGB conversion! I have this code for a pixel order YUV422 to RGB conversion. static void yuv422ToRGB(unsigned char* img, int width, int height...
Cocotte asked 20/6, 2024 at 15:3

2

Solved

How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?
Betimes asked 9/1, 2017 at 19:32

4

Solved

I am looking to optimise some SSE code I wrote for converting YUV to RGB (both planar and packed YUV functions). I am using SSSE3 at the moment, but if there are useful functions from later SSE ve...
Arman asked 31/12, 2010 at 22:20

1

Solved

std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler). The vectorized implementation would compare and replace several ele...
Danille asked 2/3, 2024 at 10:39

4

Solved

This is a somewhat low-level question. In x86 assembly there are two SSE instructions: MOVDQA xmmi, m128 and MOVNTDQA xmmi, m128 The IA-32 Software Developer's Manual says that the NT i...
Millisent asked 31/8, 2008 at 20:18

6

Solved

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87?...
Multipara asked 4/12, 2009 at 3:33

1

Solved

I have this function in C++ void routine2(float alpha, float beta) { unsigned int i, j; for (i = 0; i < N; i++) for (j = 0; j < N; j++) w[i] = w[i] - beta + alpha * A[i][j] * x[j]; } ...
Hebetate asked 19/12, 2023 at 21:4

1

I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used. void packedBlend16(uint8_t mask...
Cherie asked 16/5, 2020 at 19:52

4

Solved

SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9

2

Solved

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...
Womenfolk asked 28/7, 2017 at 14:36

4

Solved

I was trying to run the following, type Vector = array [1..4] of Single; {$CODEALIGN 16} function add4(const a, b: Vector): Vector; register; assembler; asm movaps xmm0, [a] movaps xmm1, [b] ...
Coulter asked 4/4, 2013 at 1:57

2

Solved

I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....
Grasmere asked 28/1, 2019 at 19:9

4

Solved

Profiling suggests that this function here is a real bottle neck for my application: static inline int countEqualChars(const char* string1, const char* string2, int size) { int r = 0; for (int j...
Halfdan asked 24/3, 2013 at 13:23

3

Solved

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using t...
Garrotte asked 7/3, 2014 at 17:17

4

Solved

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics. I would like ...
Fronton asked 17/12, 2022 at 4:41

3

Solved

Is there a way to get sum of values stored in __m256d variable? I have this code. acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec)); //acc in this point contains {2.0, 8.0, 18.0, 32.0} acc = _mm25...
Hendecahedron asked 20/4, 2018 at 12:27

6

Solved

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
Bluster asked 8/8, 2012 at 18:33

7

Solved

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory l...
Deering asked 4/10, 2011 at 9:48

7

Solved

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP. bool CheckSSE3() { int CPUIn...
Yuk asked 25/5, 2011 at 8:49

3

Solved

I have the following piece of C code: __m128 pSrc1 = _mm_set1_ps(4.0f); __m128 pDest; int i; for (i=0;i<100;i++) { m1 = _mm_mul_ps(pSrc1, pSrc1); m2 = _mm_mul_ps(pSrc1, pSrc1); m3 = _mm_ad...
Astera asked 16/1, 2013 at 20:49

3

I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work. The algorithm: short bitScanForward(int16_t bb) { constexpr uint16_t two = static_cast<u...
Alienist asked 3/10, 2022 at 3:31

1

Solved

The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead. Since not all compilers a...

8

Solved

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on prog...
Titmouse asked 13/9, 2009 at 12:50

© 2022 - 2025 — McMap. All rights reserved.