Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?
Asked Answered
R

2

6

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float

Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..

After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).

The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?

Rid answered 29/8, 2015 at 8:26 Comment(2)
Possible duplicate: C++ SSE filter implementationNatascha
Have you tried enabling SSE2 in Project->Properties->Code Generation? The compiler might be willing to do this for you.Hannon
P
6

a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.

Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.


Fixed point

If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.

There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.

float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale;  // fixed point scale for a: 2^4
const int bscale = 1<<logbscale;  // fixed point scale for b: 2^1

const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale));  // re-scale b to match a in the 16bit temporary result

for (i=0 ; i<n; i+=16) {
    //__m128i avec = get_scaled_a(i);  
    //__m128i bvec = get_scaled_b(i);
    //__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
    //__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);

    __m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) );  // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.

    __m128i block = _mm_load_si128(&buf[i]);  // call this { v[0] .. v[15] }

    __m128i lo = _mm_unpacklo_epi8(block, brescale);  // {v[0], 8, v[1], 8, ...}
    __m128i hi = _mm_unpackhi_epi8(block, brescale);  // {v[8], 8, v[9], 8, ...
    lo = _mm_maddubs_epi16(lo, abvec);  // first arg is unsigned bytes, 2nd arg is signed bytes
    hi = _mm_maddubs_epi16(hi, abvec);
    // lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }

    lo = _mm_srli_epi16(lo, logascale);  // truncate from scaled fixed-point to integer
    hi = _mm_srli_epi16(hi, logascale);

    // and re-pack.  Logical, not arithmetic right shift means sign bits can't be set
    block = _mm_packuswb(lo, hi);  
    _mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop

2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.

Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.

For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:

  • _mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
  • _mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
  • _mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
  • _mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.

To use these, you'd get scaled 16b vectors of a and b values, then:

  • unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
  • left shift the words by 7.
  • multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
  • right shift here if you wanted different scales for a and b, to get more fractional precision for a
  • add b to that.
  • right shift to do the final truncation back from fixed point to [0..255].

With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.

If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.


expand bytes to floats

If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.

For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.

This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)

char *buf;

// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
    __m128 a = get_a(i);
    __m128 b = get_b(i);

    // IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128.  (unlike punpck*)
    __m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i));  // load 4B at once. 
   //  Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
    __m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
  
    floats = _mm_fmadd_ps(floats, a, b);  // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
    // or without FMA, do this with _mm_mul_ps and _mm_add_ps
    unsigned_dwords = _mm_cvtps_epi32(floats);

    // repeat 3 more times for buf+4, buf+8, and buf+12, then:

    __m128i packed01 = _mm_packss_epi32(dwords0, dwords1);  // SSE2
    __m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
    // packuswb wants SIGNED input, so do signed saturation on the first step

    // saturate into [0..255] range
    __m12i8 packedbytes=_mm_packus_epi16(packed01, packed23);  // SSE2
    _mm_store_si128(buf+i, packedbytes);  // or storeu if buf isn't aligned.
}

// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0

(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.

The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.


The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.

See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.

Payola answered 29/8, 2015 at 17:40 Comment(4)
The solution to convert the byte to float works, but it seems that the solution for converting of the float to unsigned char does not work for value larger than 128Rid
@Edwin: I see the problem now. I misread the docs for pack*: they treat their input as signed. It's just the output clamp range that's different between packus and packss. stackoverflow.com/questions/29856006/… had a comment about this, which I didn't grok until now. If you convert your bytes to the [-128..127] range (with _mm_add_epi8(block, _mm_set1_epi8(-128))) before got to float and back, you can use signed saturation, then convert back with a single mm_sub. Editting my answer.Payola
Update: Either I'm dumb for not thinking of packssdw/packuswb when I wrote this, or there's something wrong with https://mcmap.net/q/17758/-sse-intrinsics-convert-32-bit-floats-to-unsigned-8-bit-integers that I'm not thinking of ATM.Payola
@Edwin: if you're still using this, I recently noticed a huge simplification in the float->uint8_t packing. There's no need for SSE4.1, only SSE2 (for the pack stage. PMOVZX is still nice to have for the unpack stage). But hopefully you're using fixed point anyway.Payola
U
0

I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.

Here is a minimal example:

#include <xmmintrin.h>
#include <stdio.h>

int main() {
  unsigned char  a[4] __attribute__((aligned(32)))= {1,2,3,4};
  float b[4] __attribute__((aligned(32)));
  _mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
  printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
  return 0;
}
Ugly answered 29/8, 2015 at 12:40 Comment(2)
The _mm_cvtpi8_ps intrinsic isn't a single instruction. It's for signed bytes, so it will give wrong answers for values > 127. Also, gcc's implementation uses punpcklbw between the input and pcmpgtb 0, input to sign-extend instead of zero-extend, then does that again for word->dword. gcc -O3 on your answer unfortunately generates nasty asm that uses MMX registers, and doesn't use pmovsx even when it's available (with -march=sandybridge)Payola
_mm_cvtpu8_ps does what the OP wants, but since it uses MMX regs instead of SSE, it still has to do more work. It's going to be slower than anything you'd write by hand with punpck with __m128i types.Payola

© 2022 - 2024 — McMap. All rights reserved.