Faster way for extracting and combining bits from UINT16 to UINT8
Asked Answered
K

4

9

I'm searching for a faster way for my required special extract and combine operation as described below:

+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   D1  |  D0   |  C1   |  C0   |  B1   |  B0   |  A1   |   A0  |
+-------+-------+-------+-------+-------+-------+-------+-------+

A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1

+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |   D   |   C   |   B   |   A   |
+-------+-------+-------+-------+-------+-------+-------+-------+

For sake of simplicity above is only an 8-bit example, the same applies for 16 bit values. It should be implemented as fast as possible on dsPIC33F microcontroller.

The easy way in C is:

PairFlags |= (ChannelFlags & 0x0003) ? 0x0001 : 0;
PairFlags |= (ChannelFlags & 0x000C) ? 0x0002 : 0;
PairFlags |= (ChannelFlags & 0x0030) ? 0x0004 : 0;
PairFlags |= (ChannelFlags & 0x00C0) ? 0x0008 : 0;
PairFlags |= (ChannelFlags & 0x0300) ? 0x0010 : 0;
PairFlags |= (ChannelFlags & 0x0C00) ? 0x0020 : 0;
PairFlags |= (ChannelFlags & 0x3000) ? 0x0040 : 0;
PairFlags |= (ChannelFlags & 0xC000) ? 0x0080 : 0;

This will produce approx. 40 instructions (with O3) which corresponds to 1µs in my case.

The amount of instruction cycles should be reduced if possible. Is there a faster way either in C or inline assembly?

Katowice answered 9/11, 2020 at 9:36 Comment(10)
Is the number of instructions or the number of branches the main performance concern?Kokura
@Kokura The number of instruction cycles are importantKatowice
I would assume that a dsPIC has all manner of fancy branch prediction though?Kokura
@Kokura I've never heard that dsPIC33F has implemented any fancy branch prediction algorithmns.Katowice
Not sure if it's a competitive solution in terms of performance, but you could do this with a simple table lookup - no asm needed.Auspice
@500-InternalServerError I'm starting to think that would be the best solution too.Kokura
A lookup table for entire source word (keep in mind we are talking about 16 bits here) or per 2-bits nibble? Later would just rebuild "OR" instruction in software as LUT.Katowice
You can probably make a 256 byte table for 8 bit version and then call that per byte in the 16 bit version?Kokura
Hmm... would a look-up table in flash cause any Harvard architecture hiccups on this part? It would ideally be 256 bytes large.Kokura
@Lundin: Indeed, using the first step of Ian's answer, another shift/OR/truncate-to-8bit leaves you with just an 8-bit bit-shuffle problem, see my comment. If a 256-byte LUT is good, that would be the way to go.Vansickle
S
9

The following should work for reducing a 16-bit value to 8 bits (with each bit of output formed by ORing a pair of bits of input):

// Set even bits to bits in pair ORed together, and odd bits to 0...
PairFlags = (ChannelFlags | (ChannelFlags >> 1)) & 0x5555; // '0h0g0f0e0d0c0b0a'
// Compress the '00' or '01' bit pairs down to single '0' or '1' bits...
PairFlags = (PairFlags ^ (PairFlags >> 1)) & 0x3333; // '00hg00fe00dc00ba'
PairFlags = (PairFlags ^ (PairFlags >> 2)) & 0x0F0F; // '0000hgfe0000dcba'
PairFlags = (PairFlags ^ (PairFlags >> 4)) & 0x00FF; // '00000000hgfedcba'

Note: The ^ can be replaced by | in the above for the same result.

Shirberg answered 9/11, 2020 at 10:53 Comment(3)
This is exactly what I was looking for, haven't tested it yet, however it was compiled to only 15 branch free instructions.Katowice
@bkausbk: If lookup tables are efficient, use the first step of this answer, then turn 0h0g0f0e0d0c0b0a into hdgcfbea by doing PairFlags |= PairFlags >> 7 and taking the low byte. ((uint8_t) or & 0xFF). Then a 256 x 8-bit LUT can do the bit-shuffle to give the bits in desired order. On a modern x86 CPU, 3 more shift/xor/and steps would probably be faster than a lookup table (except maybe with clever use of SSSE3 pshufb for dbca -> dcba nibble lookups), but if a load is guaranteed to only take a few cycles (no cache miss possible) and 256B of table space is cheap, try it.Vansickle
Of course, on recent Intel CPUs (which have fast BMI2 pext, unlike recent AMD where it's slow uops.info), you'd only need _pext_u32( ChannelFlags | (ChannelFlags << 1), 0xAAAA ). Like 3 asm instructions (lea/OR/pext), or 4 including a mov-immediate to set up the constant. Left shift instead of right allows it to be done using lea to avoid destroying the source operand, instead of mov + shl.Vansickle
K
7

Assuming I got everything right (not tested), this seems to generate good, branch-free code at least on gcc and clang for x86 (-O3):

uint8_t convert (uint8_t ChannelFlags)
{
  return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
         ( ((ChannelFlags & B1B0)!=0) << B_POS ) |
         ( ((ChannelFlags & C1C0)!=0) << C_POS ) |
         ( ((ChannelFlags & D1D0)!=0) << D_POS ) ;  
}

This masks out each individual bitset, then check against zero to end up with 1 or 0 in a temporary int. This value is shifted in position in the result, before everything is finally bitwise OR:ed together. Full code:

#include <stdint.h>

#define A1A0  (3u << 0)
#define B1B0  (3u << 2)
#define C1C0  (3u << 4)
#define D1D0  (3u << 6)

#define A_POS 0
#define B_POS 1
#define C_POS 2
#define D_POS 3

uint8_t convert (uint8_t ChannelFlags)
{
  return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
         ( ((ChannelFlags & B1B0)!=0) << B_POS ) |
         ( ((ChannelFlags & C1C0)!=0) << C_POS ) |
         ( ((ChannelFlags & D1D0)!=0) << D_POS ) ;  
}

clang disassembly x86 gives 18 instructions branch free:

convert:                                # @convert
        test    dil, 3
        setne   al
        test    dil, 12
        setne   cl
        add     cl, cl
        or      cl, al
        test    dil, 48
        setne   al
        shl     al, 2
        or      al, cl
        mov     ecx, edi
        shr     cl, 7
        shr     dil, 6
        and     dil, 1
        or      dil, cl
        shl     dil, 3
        or      al, dil
        ret
Kokura answered 9/11, 2020 at 10:17 Comment(3)
I assume you ment #define A_POS (0), #define B_POS (1) ... In any case it boils down to exactly my given C way which unfortunately is not fast (considering 16 bits is converted).Katowice
@Katowice Oh right, that's a bug. Hang on, I'll edit.Kokura
@Katowice Fixed. Turns out the second version I posted gave better machine code when the bit masks were corrected.Kokura
S
4

Not sure if more efficient but instead of using a ternary if, why not use only bitwise operations ? And just offset it with the bitshift operator

PairFlags = ((ChannelFlags & (0b1 << 0)) | (ChannelFlags & (0b10 << 0))) << 0;
PairFlags = ((ChannelFlags & (0b1 << 2)) | (ChannelFlags & (0b10 << 2))) << 1;
PairFlags = ((ChannelFlags & (0b1 << 4)) | (ChannelFlags & (0b10 << 4))) << 2;
//...
Stagecraft answered 9/11, 2020 at 9:57 Comment(0)
L
3

Here is an idea. Observe one thing here:

A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1

You have 4 or operations. You can perform all of them in 1 instruction:

PairFlags = (PairFlags | (PairFlags >> 1))

Now you bits are aligned like that:

[D1][D1 or D0][D0 or C1][C1 or C0][C0 or B1][B1 or B0][B0 or A1][A1 or A0]

So you just need to extract bits 0, 2, 4, 6 to get the result.

Bit 0. Is already OK.

Bit 1 should be set to bit 2.

Bit 2 should be set to bit 4.

Bit 3 should be set to bit 6.

Final code something like that:

PairFlags = (PairFlags | (PairFlags >> 1))
PairFlags = (PairFlags&1) | ((PairFlags&4)>>1) | ((PairFlags&16)>>2) | ((PairFlags&64)>>3)
Lacuna answered 9/11, 2020 at 10:18 Comment(1)
Indeed a clever method. It compiles to slightly less instructions (26) and is branch free.Katowice

© 2022 - 2024 — McMap. All rights reserved.