How to optimize a cycle?

C

6

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

Chaldean answered 21/10, 2010 at 11:40 Comment(3)

Do you know which processor architecture you are targeting? Working with one 64 bit word chunk at a time (bit twiddling to make the comparisons in-register) could reduce memory bus contention somewhat. The compiler's assembly code should help provide ideas... ...and isn't SSE intended for floating point, not integer operations? – Johannessen 21/10, 2010 at 11:58

SSE has some integer instructions. – Huddle 21/10, 2010 at 11:59

Why not made them signed? a simple XOR 0x80 with each element before comparation will do the job. – Cali 21/10, 2010 at 12:1

B

9

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

use _mm_min_epu8 to get the unsigned min of p1, p2
compare this min for equality with p2 using _mm_cmpeq_epi8
the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

Bordure answered 21/10, 2010 at 12:58 Comment(0)

G

2

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

Guideboard answered 21/10, 2010 at 12:13 Comment(2)

It seems like right answer. But I think that your 127 must be replaced with 128. Or xor with 128. – Chaldean 21/10, 2010 at 12:22

The problem is I think there's only a packed add in MMX, which is a different register set entirely. – Huddle 21/10, 2010 at 12:23

H

2

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Huddle answered 21/10, 2010 at 12:19 Comment(0)

D

1

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}

Disc answered 21/10, 2010 at 12:1 Comment(1)

@ VJo - yes, of course. On single core computer this code performs exactly like original code from the question. – Disc 21/10, 2010 at 12:9

B

-1

Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:

unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)

The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.

The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.

However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:

old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.

In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).

Billy answered 4/7, 2013 at 0:37 Comment(1)

If you look closely at my answer you'll see I'm using an unsigned min operation. – Bordure 20/11, 2015 at 11:14

I

-3

use pcmpeqb and be the Power with you.

Impressionism answered 21/10, 2010 at 12:6 Comment(3)

pcmpeqb is a check for equal. I need less compare. – Chaldean 21/10, 2010 at 12:11

ah yes. then pcmpgtb. still use the Power. but wisely. – Impressionism 21/10, 2010 at 12:16

OP needs unsigned comparison. – Bullyrag 21/10, 2010 at 12:47

Recommended topics

Hot tags