Emulating variable bit-shift using only constant shifts?
Asked Answered
W

8

12

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like

int ShiftByConstant( int x ) { return x << 3 ; } 

is fast, single-op, and superscalar, whereas a shift-by-variable, like

int ShiftByVar( int x, int y ) { return x << y ; }

is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.

What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.

I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty, even for correctly-predicted branches.)

This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a high level language or even pseudo code would be perfectly helpful.

Edit: A couple of clarifications that I should add:

  1. I'm not even a little bit worried about portability
  2. PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function

    int isel(a, b, c)  { return a >= 0 ? b : c; }
    

    (if you write out a ternary that does the same thing I'll get what you mean)

  3. integer multiplication is also microcoded and even slower than sraw. :-(
  4. On Xenon PPC, the latency of a predicted branch is 8 cycles, so even one makes it as costly as the microcoded instruction. Jump-to-pointer (any indirect branch or function pointer) is a guaranteed mispredict, a 24 cycle stall.
Winner answered 12/2, 2009 at 3:9 Comment(3)
One thing that springs into mind is Duffs Device (en.wikipedia.org/wiki/Duffs_device) with shift instructions of one bit instead. You need one branch and then several shift instructions so I guess it's slower.Monastic
@Some: The single branch penalty is greater than the microcode instruction penalty so a Duffs Device wouldn't be an optimization.Eructate
playstation3 / cell programmer, eh?Widdershins
E
8

Here you go...

I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).

The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.

template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}

EDIT: Note on isel() I saw your isel() code on your website.

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:

return (x & (~mask)) + (y & mask);
Eructate answered 21/10, 2009 at 21:35 Comment(3)
Thanks! Yeah, after blundering around for a while I concluded there wasn't a way to beat the microcode here. I suppose it uses micro-ops for which there are no corresponding opcodes in the ISA. Thanks for the improved isel() -- I just used Dawson's, never even thought it could be improved on!Winner
When I first read your post, I thought you had found a magic isel() intrinsic/asm-op that I had somehow missed vs the mask function which would have been very nice. FWIW, you can do it quite quickly on PC as well with CMOVcc asm-ops so it bears in mind possibly having different versions of isel on different target platforms.Eructate
Oh, and it's probably obvious but the nVal= lines are basically an isel() that has been expanded.Eructate
P
5

How about this:

if (y & 16) x <<= 16;
if (y & 8) x <<= 8;
if (y & 4) x <<= 4;
if (y & 2) x <<= 2;
if (y & 1) x <<= 1;

will probably take longer yet to execute but easier to interleave if you have other code to go between.

Periotic answered 12/2, 2009 at 3:19 Comment(4)
The question specified no branches!Aciculum
Yeah but what he's saying can be achieved with a branchless conditional-move op -- I get what he's trying to communicate.Winner
Oh, the predicated instructions are a total game-changer. Joshua is saved from a nasty downvote! How does it perform?Aciculum
I don't think that code is right. Those bit tests should be with powers of 2.Jehiah
H
4

Let's assume that your max shift is 31. So the shift amount is a 5-bit number. Because shifting is cumulative, we can break this into five constant shifts. The obvious version uses branching, but you ruled that out.

Let N be a number between 1 and 5. You want to shift x by 2N if the bit whose value is 2N is set in y, otherwise keep x intact. Here one way to do it:

#define SHIFT(N) x = isel(((y >> N) & 1) - 1, x << (1 << N), x);

The macro assigns to x either x << 2ᴺ or x, depending on whether the Nth bit is set in y or not.

And then the driver:

SHIFT(1); SHIFT(2); SHIFT(3); SHIFT(4); SHIFT(5)

Note that N is a macro variable and becomes constant.

Don't know though if this is going to be actually faster than the variable shift. If it would be, one wonders why the microcode wouldn't run this instead...

Heatstroke answered 12/2, 2009 at 4:6 Comment(2)
That's interesting -- I'll try it out in the simulator. The microcoded op definitely works by substituting itself with some sequence of other not-microcoded ops and then running them instead; the problem is it's not pipelined, so I'm trying to figure out what that magic sequence of μops is.Winner
If you used x = isel(-(signed)(y >> N & 1), x, x << (1 << N)) you might save an extra op.Jehiah
A
1

This one breaks my head. I've now discarded a half dozen ideas. All of them exploit the notion that adding a thing to itself shifts left 1, doing the same to the result shifts left 4, and so on. If you keep all the partial results for shift left 0, 1, 2, 4, 8, and 16, then by testing bits 0 to 4 of the shift variable you can get your initial shift. Now do it again, once for each 1 bit in the shift variable. Frankly, you might as well send your processor out for coffee.

The one place I'd look for real help is Hank Warren's Hacker's Delight (which is the only useful part of this answer).

Aciculum answered 12/2, 2009 at 3:27 Comment(1)
Yeah, I ran into the same wall that you did. However I find the phrase "you might as well send your processor out for coffee" utterly delightful and will use it at every possible excuse henceforth. =)Winner
H
0

How about this:

int[] multiplicands = { 1, 2, 4, 8, 16, 32, ... etc ...};

int ShiftByVar( int x, int y )
{
    //return x << y;
    return x * multiplicands[y];
}
Hausmann answered 12/2, 2009 at 3:33 Comment(1)
sadly, multiply is pretty slow too. =(Winner
K
0

If the shift count can be calculated far in advance then I have two ideas that might work

  • Using self-modifying code

    Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift

  • Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction

    // shift by constant functions
    typedef int (*shiftFunc)(int);    // the shift function
    #define SHL(n) int shl##n(int x) { return x << (n); }
    SHL(1)
    SHL(2)
    SHL(3)
    ...
    shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
    
    int arr[MAX];       // all the values that need to be shifted with the same amount
    shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
    for (int i = 0; i < MAX; i++)
        arr[i] = shl(arr[i]);
    

    This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.

    Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD


If the range of the values is small, lookup table is another possible solution

#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                    S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
    { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
    { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
    ...
    { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}

Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together

Koger answered 25/1, 2019 at 12:1 Comment(2)
I don't see how using a case to jump into the middle of an unrolled do{}while() (Duff's Device) is helpful here. And even if it was, according to a comment on a deleted answer, the target uarch has very costly branches and no indirect-branch prediction, so function pointers and Duff's Device are total showstoppers. I'll edit that into the question.Affectation
You'd need to replicate your whole loop for each shift count, so dispatch branch cost can be outside the loop. If the target uarch has Altivec for SIMD shifts, doing a whole array of shifts is that much cheaper so this becomes closer to viable.Affectation
M
-1

There is some good stuff here regarding bit manipulation black magic: Advanced bit manipulation fu (Christer Ericson's blog)

Don't know if any of it's directly applicable, but if there is a way, likely there are some hints to that way in there somewhere.

Milled answered 12/2, 2009 at 4:27 Comment(0)
J
-1

Here's something that is trivially unrollable:

int result= value;

int shift_accumulator= value;

for (int i= 0; i<5; ++i)
{
    result += shift_accumulator & (-(k & 1)); // replace with isel if appropriate
    shift_accumulator += shift_accumulator;
    k >>= 1;
}
Jehiah answered 24/8, 2009 at 17:34 Comment(1)
Shouldn't that be k-=1? And this looks like you're multiplying by a 5-bit k, not by a 32-bit 1<<k. e.g. for k=31, how is this going to get the low bit to the MSB?Affectation

© 2022 - 2024 — McMap. All rights reserved.