sse Questions
1
Solved
I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code:
fl...
Deposit asked 11/9, 2024 at 11:52
3
Solved
Is there a way of pushing a packed doubleword integer from XMM register to the stack? and then later on pop it back when needed?
Ideally I am looking for something like PUSH or POP for general pur...
0
Please note that this question is not about YUV422 to RGB conversion!
I have this code for a pixel order YUV422 to RGB conversion.
static void yuv422ToRGB(unsigned char* img,
int width, int height...
Cocotte asked 20/6, 2024 at 15:3
2
Solved
How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?
Betimes asked 9/1, 2017 at 19:32
4
Solved
I am looking to optimise some SSE code I wrote for converting YUV to RGB (both planar and packed YUV functions).
I am using SSSE3 at the moment, but if there are useful functions from later SSE ve...
Arman asked 31/12, 2010 at 22:20
1
Solved
std::replace implementation can be optimized using vectorization (by specializing the library implementation or by the compiler).
The vectorized implementation would compare and replace several ele...
Danille asked 2/3, 2024 at 10:39
4
Solved
This is a somewhat low-level question. In x86 assembly there are two SSE instructions:
MOVDQA xmmi, m128
and
MOVNTDQA xmmi, m128
The IA-32 Software Developer's Manual says that the NT i...
6
Solved
1
Solved
I have this function in C++
void routine2(float alpha, float beta) {
unsigned int i, j;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
w[i] = w[i] - beta + alpha * A[i][j] * x[j];
}
...
Hebetate asked 19/12, 2023 at 21:4
1
I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used.
void packedBlend16(uint8_t mask...
4
Solved
SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers.
_mm_cvtps_epi32()
_mm_cvtepi32_ps()
But there are no equivalents for double-precision and 64-bi...
Apotheosize asked 14/12, 2016 at 14:9
2
Solved
Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this:
i...
4
Solved
I was trying to run the following,
type
Vector = array [1..4] of Single;
{$CODEALIGN 16}
function add4(const a, b: Vector): Vector; register; assembler;
asm
movaps xmm0, [a]
movaps xmm1, [b]
...
2
Solved
I want to perform an arbitrary permutation of single bits, pairs of bits, and nibbles (4 bits) on a CPU register (xmm, ymm or zmm) of width 128, 256 or 512 bits; this should be as fast as possible....
4
Solved
Profiling suggests that this function here is a real bottle neck for my application:
static inline int countEqualChars(const char* string1, const char* string2, int size) {
int r = 0;
for (int j...
Halfdan asked 24/3, 2013 at 13:23
3
Solved
4
Solved
Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics.
I would like ...
Fronton asked 17/12, 2022 at 4:41
3
Solved
Is there a way to get sum of values stored in __m256d variable? I have this code.
acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm25...
Hendecahedron asked 20/4, 2018 at 12:27
6
Solved
I decided to continue Fast corners optimisation and stucked at
_mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
7
Solved
Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory l...
Deering asked 4/10, 2011 at 9:48
7
Solved
Is the following code valid to check if a CPU supports the SSE3 instruction set?
Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP.
bool CheckSSE3()
{
int CPUIn...
Yuk asked 25/5, 2011 at 8:49
3
Solved
I have the following piece of C code:
__m128 pSrc1 = _mm_set1_ps(4.0f);
__m128 pDest;
int i;
for (i=0;i<100;i++) {
m1 = _mm_mul_ps(pSrc1, pSrc1);
m2 = _mm_mul_ps(pSrc1, pSrc1);
m3 = _mm_ad...
3
I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work.
The algorithm:
short bitScanForward(int16_t bb)
{
constexpr uint16_t two = static_cast<u...
Alienist asked 3/10, 2022 at 3:31
1
Solved
The _mm_load_ps() SSE intrinsic is defined as aligned, throwing exception if the address is not aligned. However, it seems visual studio generates unaligned read instead.
Since not all compilers a...
Sipper asked 15/5, 2020 at 9:32
8
Solved
In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on prog...
Titmouse asked 13/9, 2009 at 12:50
1 Next >
© 2022 - 2025 — McMap. All rights reserved.