How to optimize a cycle?
Asked Answered
C

6

6

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

Chaldean answered 21/10, 2010 at 11:40 Comment(3)
Do you know which processor architecture you are targeting? Working with one 64 bit word chunk at a time (bit twiddling to make the comparisons in-register) could reduce memory bus contention somewhat. The compiler's assembly code should help provide ideas... ...and isn't SSE intended for floating point, not integer operations?Johannessen
SSE has some integer instructions.Huddle
Why not made them signed? a simple XOR 0x80 with each element before comparation will do the job.Cali
B
9

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

  • use _mm_min_epu8 to get the unsigned min of p1, p2
  • compare this min for equality with p2 using _mm_cmpeq_epi8
  • the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
  • you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

Bordure answered 21/10, 2010 at 12:58 Comment(0)
G
2

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

Guideboard answered 21/10, 2010 at 12:13 Comment(2)
It seems like right answer. But I think that your 127 must be replaced with 128. Or xor with 128.Chaldean
The problem is I think there's only a packed add in MMX, which is a different register set entirely.Huddle
H
2

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Huddle answered 21/10, 2010 at 12:19 Comment(0)
D
1

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}
Disc answered 21/10, 2010 at 12:1 Comment(1)
@ VJo - yes, of course. On single core computer this code performs exactly like original code from the question.Disc
B
-1

Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:

unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)

The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.

The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.

However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:

old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.

In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).

Billy answered 4/7, 2013 at 0:37 Comment(1)
If you look closely at my answer you'll see I'm using an unsigned min operation.Bordure
I
-3

use pcmpeqb and be the Power with you.

Impressionism answered 21/10, 2010 at 12:6 Comment(3)
pcmpeqb is a check for equal. I need less compare.Chaldean
ah yes. then pcmpgtb. still use the Power. but wisely.Impressionism
OP needs unsigned comparison.Bullyrag

© 2022 - 2024 — McMap. All rights reserved.