Faster way for extracting and combining bits from UINT16 to UINT8

Asked 9/11, 2020 at 9:36 Answered 9/11, 2020 at 10:53

Solved c algorithm assembly optimization dspic

I'm searching for a faster way for my required special extract and combine operation as described below:

+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   D1  |  D0   |  C1   |  C0   |  B1   |  B0   |  A1   |   A0  |
+-------+-------+-------+-------+-------+-------+-------+-------+

A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1

+-------+-------+-------+-------+-------+-------+-------+-------+
| BIT 7 | BIT 6 | BIT 5 | BIT 4 | BIT 3 | BIT 2 | BIT 1 | BIT 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |   D   |   C   |   B   |   A   |
+-------+-------+-------+-------+-------+-------+-------+-------+

For sake of simplicity above is only an 8-bit example, the same applies for 16 bit values. It should be implemented as fast as possible on dsPIC33F microcontroller.

The easy way in C is:

PairFlags |= (ChannelFlags & 0x0003) ? 0x0001 : 0;
PairFlags |= (ChannelFlags & 0x000C) ? 0x0002 : 0;
PairFlags |= (ChannelFlags & 0x0030) ? 0x0004 : 0;
PairFlags |= (ChannelFlags & 0x00C0) ? 0x0008 : 0;
PairFlags |= (ChannelFlags & 0x0300) ? 0x0010 : 0;
PairFlags |= (ChannelFlags & 0x0C00) ? 0x0020 : 0;
PairFlags |= (ChannelFlags & 0x3000) ? 0x0040 : 0;
PairFlags |= (ChannelFlags & 0xC000) ? 0x0080 : 0;

This will produce approx. 40 instructions (with O3) which corresponds to 1µs in my case.

The amount of instruction cycles should be reduced if possible. Is there a faster way either in C or inline assembly?

Katowice answered 9/11, 2020 at 9:36 Comment(10)

Is the number of instructions or the number of branches the main performance concern? – Kokura 9/11, 2020 at 10:10

@Kokura The number of instruction cycles are important – Katowice 9/11, 2020 at 10:14

I would assume that a dsPIC has all manner of fancy branch prediction though? – Kokura 9/11, 2020 at 10:28

@Kokura I've never heard that dsPIC33F has implemented any fancy branch prediction algorithmns. – Katowice 9/11, 2020 at 10:36

Not sure if it's a competitive solution in terms of performance, but you could do this with a simple table lookup - no asm needed. – Auspice 9/11, 2020 at 10:43

@500-InternalServerError I'm starting to think that would be the best solution too. – Kokura 9/11, 2020 at 10:44

A lookup table for entire source word (keep in mind we are talking about 16 bits here) or per 2-bits nibble? Later would just rebuild "OR" instruction in software as LUT. – Katowice 9/11, 2020 at 10:46

You can probably make a 256 byte table for 8 bit version and then call that per byte in the 16 bit version? – Kokura 9/11, 2020 at 10:56

Hmm... would a look-up table in flash cause any Harvard architecture hiccups on this part? It would ideally be 256 bytes large. – Kokura 9/11, 2020 at 10:59

@Lundin: Indeed, using the first step of Ian's answer, another shift/OR/truncate-to-8bit leaves you with just an 8-bit bit-shuffle problem, see my comment. If a 256-byte LUT is good, that would be the way to go. – Vansickle 16/11, 2020 at 18:14

The following should work for reducing a 16-bit value to 8 bits (with each bit of output formed by ORing a pair of bits of input):

// Set even bits to bits in pair ORed together, and odd bits to 0...
PairFlags = (ChannelFlags | (ChannelFlags >> 1)) & 0x5555; // '0h0g0f0e0d0c0b0a'
// Compress the '00' or '01' bit pairs down to single '0' or '1' bits...
PairFlags = (PairFlags ^ (PairFlags >> 1)) & 0x3333; // '00hg00fe00dc00ba'
PairFlags = (PairFlags ^ (PairFlags >> 2)) & 0x0F0F; // '0000hgfe0000dcba'
PairFlags = (PairFlags ^ (PairFlags >> 4)) & 0x00FF; // '00000000hgfedcba'

Note: The ^ can be replaced by | in the above for the same result.

Shirberg answered 9/11, 2020 at 10:53 Comment(3)

This is exactly what I was looking for, haven't tested it yet, however it was compiled to only 15 branch free instructions. – Katowice 9/11, 2020 at 11:5

@bkausbk: If lookup tables are efficient, use the first step of this answer, then turn 0h0g0f0e0d0c0b0a into hdgcfbea by doing PairFlags |= PairFlags >> 7 and taking the low byte. ((uint8_t) or & 0xFF). Then a 256 x 8-bit LUT can do the bit-shuffle to give the bits in desired order. On a modern x86 CPU, 3 more shift/xor/and steps would probably be faster than a lookup table (except maybe with clever use of SSSE3 pshufb for dbca -> dcba nibble lookups), but if a load is guaranteed to only take a few cycles (no cache miss possible) and 256B of table space is cheap, try it. – Vansickle 16/11, 2020 at 18:11

Of course, on recent Intel CPUs (which have fast BMI2 pext, unlike recent AMD where it's slow uops.info), you'd only need _pext_u32( ChannelFlags | (ChannelFlags << 1), 0xAAAA ). Like 3 asm instructions (lea/OR/pext), or 4 including a mov-immediate to set up the constant. Left shift instead of right allows it to be done using lea to avoid destroying the source operand, instead of mov + shl. – Vansickle 16/11, 2020 at 19:0

Assuming I got everything right (not tested), this seems to generate good, branch-free code at least on gcc and clang for x86 (-O3):

uint8_t convert (uint8_t ChannelFlags)
{
  return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
         ( ((ChannelFlags & B1B0)!=0) << B_POS ) |
         ( ((ChannelFlags & C1C0)!=0) << C_POS ) |
         ( ((ChannelFlags & D1D0)!=0) << D_POS ) ;  
}

This masks out each individual bitset, then check against zero to end up with 1 or 0 in a temporary int. This value is shifted in position in the result, before everything is finally bitwise OR:ed together. Full code:

#include <stdint.h>

#define A1A0  (3u << 0)
#define B1B0  (3u << 2)
#define C1C0  (3u << 4)
#define D1D0  (3u << 6)

#define A_POS 0
#define B_POS 1
#define C_POS 2
#define D_POS 3

uint8_t convert (uint8_t ChannelFlags)
{
  return ( ((ChannelFlags & A1A0)!=0) << A_POS ) |
         ( ((ChannelFlags & B1B0)!=0) << B_POS ) |
         ( ((ChannelFlags & C1C0)!=0) << C_POS ) |
         ( ((ChannelFlags & D1D0)!=0) << D_POS ) ;  
}

clang disassembly x86 gives 18 instructions branch free:

convert:                                # @convert
        test    dil, 3
        setne   al
        test    dil, 12
        setne   cl
        add     cl, cl
        or      cl, al
        test    dil, 48
        setne   al
        shl     al, 2
        or      al, cl
        mov     ecx, edi
        shr     cl, 7
        shr     dil, 6
        and     dil, 1
        or      dil, cl
        shl     dil, 3
        or      al, dil
        ret

Kokura answered 9/11, 2020 at 10:17 Comment(3)

I assume you ment #define A_POS (0), #define B_POS (1) ... In any case it boils down to exactly my given C way which unfortunately is not fast (considering 16 bits is converted). – Katowice 9/11, 2020 at 10:33

@Katowice Oh right, that's a bug. Hang on, I'll edit. – Kokura 9/11, 2020 at 10:35

@Katowice Fixed. Turns out the second version I posted gave better machine code when the bit masks were corrected. – Kokura 9/11, 2020 at 10:39

Not sure if more efficient but instead of using a ternary if, why not use only bitwise operations ? And just offset it with the bitshift operator

PairFlags = ((ChannelFlags & (0b1 << 0)) | (ChannelFlags & (0b10 << 0))) << 0;
PairFlags = ((ChannelFlags & (0b1 << 2)) | (ChannelFlags & (0b10 << 2))) << 1;
PairFlags = ((ChannelFlags & (0b1 << 4)) | (ChannelFlags & (0b10 << 4))) << 2;
//...

Stagecraft answered 9/11, 2020 at 9:57 Comment(0)

Here is an idea. Observe one thing here:

A = A0 OR A1
B = B0 OR B1
C = C0 OR C1
D = D0 OR D1

You have 4 or operations. You can perform all of them in 1 instruction:

PairFlags = (PairFlags | (PairFlags >> 1))

Now you bits are aligned like that:

[D1][D1 or D0][D0 or C1][C1 or C0][C0 or B1][B1 or B0][B0 or A1][A1 or A0]

So you just need to extract bits 0, 2, 4, 6 to get the result.

Bit 0. Is already OK.

Bit 1 should be set to bit 2.

Bit 2 should be set to bit 4.

Bit 3 should be set to bit 6.

Final code something like that:

PairFlags = (PairFlags | (PairFlags >> 1))
PairFlags = (PairFlags&1) | ((PairFlags&4)>>1) | ((PairFlags&16)>>2) | ((PairFlags&64)>>3)

Lacuna answered 9/11, 2020 at 10:18 Comment(1)

Indeed a clever method. It compiles to slightly less instructions (26) and is branch free. – Katowice 9/11, 2020 at 11:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags