Bit hack: Expanding bits

Asked 10/8, 2016 at 19:22 Answered 17/11, 2023 at 12:31

I am trying to convert a uint16_t input to a uint32_t bit mask. One bit in the input toggles two bits in the output bit mask. Here is an example converting a 4-bit input to an 8-bit bit mask:

Input    Output
ABCDb -> AABB CCDDb

A,B,C,D are individual bits

Example outputs:

0000b -> 0000 0000b
0001b -> 0000 0011b
0010b -> 0000 1100b
0011b -> 0000 1111b
....
1100b -> 1111 0000b
1101b -> 1111 0011b
1110b -> 1111 1100b
1111b -> 1111 1111b

Is there a bithack-y way to achieve this behavior?

Roentgenotherapy answered 10/8, 2016 at 19:22 Comment(12)

Do you allow the use of pdep? – Vaporous 10/8, 2016 at 19:27

Sean Eron Anderson has this in his bit twiddling hacks: graphics.stanford.edu/~seander/… - also the three ones below. – Cordate 10/8, 2016 at 19:28

@harold No, sorry. Straight-up C. – Roentgenotherapy 10/8, 2016 at 19:28

So no _pdep_u32 intrinsic either? – Vaporous 10/8, 2016 at 19:30

What non-"bithack-y way" did you try? Why does that not work/is too complex/too slow/etc.? – Stepp 10/8, 2016 at 19:30

@harold: That's not standard. What is pdep?? – Stepp 10/8, 2016 at 19:31

@harold No intrinsics. Trying to keep this portable. – Roentgenotherapy 10/8, 2016 at 19:32

@Olaf it scatters bits according to a mask, putting the lsb at the position of the lowest set bit in the mask and so on. This problem could be solved with a pdep and a multiplication by 3. – Vaporous 10/8, 2016 at 19:33

@harold: Well, it does not exist on my targets. You seem to assume x86 is ISO-given. How do you know OP uses x86? – Stepp 10/8, 2016 at 19:37

@Olaf I don't, that's why I asked whether he allows it. – Vaporous 10/8, 2016 at 19:37

My gut-feeling is that there is a multiplication-trick possible here. BTW: Stanford bit hacks does have bit-interleaving, IIRC. – Cayenne 10/8, 2016 at 20:7

@wildplasser: sure they have: graphics.stanford.edu/~seander/bithacks.html#Interleave64bitOps - but God is this hacky – Cordate 10/8, 2016 at 20:17

Interleaving bits by Binary Magic Numbers contained the clue:

uint32_t expand_bits(uint16_t bits)
{
    uint32_t x = bits;

    x = (x | (x << 8)) & 0x00FF00FF;
    x = (x | (x << 4)) & 0x0F0F0F0F;
    x = (x | (x << 2)) & 0x33333333;
    x = (x | (x << 1)) & 0x55555555;

    return x | (x << 1);
}

The first four steps consecutively interleave the source bits in groups of 8, 4, 2, 1 bits with zero bits, resulting in 00AB00CD after the first step, 0A0B0C0D after the second step, and so on. The last step then duplicates each even bit (containing an original source bit) into the neighboring odd bit, thereby achieving the desired bit arrangement.

A number of variants are possible. The last step can also be coded as x + (x << 1) or 3 * x. The | operators in the first four steps can be replaced by ^ operators. The masks can also be modified as some bits are naturally zero and don't need to be cleared. On some processors short masks may be incorporated into machine instructions as immediates, reducing the effort for constructing and / or loading the mask constants. It may also be advantageous to increase instruction-level parallelism for out-of-order processors and optimize for those with shift-add or integer-multiply-add instructions. One code variant incorporating various of these ideas is:

uint32_t expand_bits (uint16_t bits)
{
    uint32_t x = bits;

    x = (x ^ (x << 8)) & ~0x0000FF00;
    x = (x ^ (x << 4)) & ~0x00F000F0;
    x = x ^ (x << 2);
    x = ((x & 0x22222222) << 1) + (x & 0x11111111);
    x = (x << 1) + x;

    return x;
}

Roentgenotherapy answered 10/8, 2016 at 19:22 Comment(2)

Can you explain how this works? It's quite short, but it's not immediately clear what's going on. – Zelikow 10/8, 2016 at 23:30

x = (x << 1) + x; can be reduced to x *= 3; Proof: ((x <<1) + x ) ~= (x * 2 + x) ~= x * 3. – Backstay 29/4, 2022 at 18:8

The easiest way to map a 4-bit input to an 8-bit output is with a 16 entry table. So then it's just a matter of extracting 4 bits at a time from the uint16_t, doing a table lookup, and inserting the 8-bit value into the output.

uint32_t expandBits( uint16_t input )
{
    uint32_t table[16] = {
        0x00, 0x03, 0x0c, 0x0f,
        0x30, 0x33, 0x3c, 0x3f,
        0xc0, 0xc3, 0xcc, 0xcf,
        0xf0, 0xf3, 0xfc, 0xff
    };

    uint32_t output;
    output  = table[(input >> 12) & 0xf] << 24;
    output |= table[(input >>  8) & 0xf] << 16;
    output |= table[(input >>  4) & 0xf] <<  8;
    output |= table[ input        & 0xf];
    return output;
}

This provides a decent compromise between performance and readability. It doesn't have quite the performance of cmaster's over-the-top lookup solution, but it's certainly more understandable than thndrwrks' magical mystery solution. As such, it provides a technique that can be applied to a much larger variety of problems, i.e. use a small lookup table to solve a larger problem.

Desrosiers answered 10/8, 2016 at 21:2 Comment(2)

making the table static const should improve performance. – Glorious 19/8, 2016 at 4:31

@Glorious Interesting suggestion. I tried it with clang, and it makes no difference. The compiler already puts the table in the text segment, even without the static const. It's able to do that because, (1) table has local scope (2) table is never used as an lvalue (3) there are no pointers in the function, so nothing can point to table. – Desrosiers 19/8, 2016 at 5:12

In case you want to get some estimate of relative speeds, some community wiki test code. Adjust as needed.

void f_cmp(uint32_t (*f1)(uint16_t x), uint32_t (*f2)(uint16_t x)) {
  uint16_t x = 0;
  do {
    uint32_t y1 = (*f1)(x);
    uint32_t y2 = (*f2)(x);
    if (y1 != y2) {
      printf("%4x %8lX %8lX\n", x, (unsigned long) y1, (unsigned long) y2);
    }
  } while (x++ != 0xFFFF);
}

void f_time(uint32_t (*f1)(uint16_t x)) {
  f_cmp(expand_bits, f1);
  clock_t t1 = clock();
  volatile uint32_t y1 = 0;
  unsigned n = 1000;
  for (unsigned i = 0; i < n; i++) {
    uint16_t x = 0;
    do {
      y1 += (*f1)(x);
    } while (x++ != 0xFFFF);
  }
  clock_t t2 = clock();
  printf("%6llu %6llu: %.6f %lX\n", (unsigned long long) t1,
          (unsigned long long) t2, 1.0 * (t2 - t1) / CLOCKS_PER_SEC / n,
          (unsigned long) y1);
  fflush(stdout);
}

int main(void) {
  f_time(expand_bits);
  f_time(expandBits);
  f_time(remask);
  f_time(javey);
  f_time(thndrwrks_expand);
  // now in the other order
  f_time(thndrwrks_expand);
  f_time(javey);
  f_time(remask);
  f_time(expandBits);
  f_time(expand_bits);
  return 0;
}

Results

     0    280: 0.000280 FE0C0000 // fast
   280    702: 0.000422 FE0C0000
   702   1872: 0.001170 FE0C0000
  1872   3026: 0.001154 FE0C0000
  3026   4399: 0.001373 FE0C0000 // slow

  4399   5740: 0.001341 FE0C0000
  5740   6879: 0.001139 FE0C0000
  6879   8034: 0.001155 FE0C0000
  8034   8470: 0.000436 FE0C0000
  8486   8751: 0.000265 FE0C0000

Backwards answered 10/8, 2016 at 19:22 Comment(0)

Here's a working implementation:

uint32_t remask(uint16_t x)
{
    uint32_t i;
    uint32_t result = 0;
    for (i=0;i<16;i++) {
        uint32_t mask = (uint32_t)x & (1U << i);
        result |= mask << (i);
        result |= mask << (i+1);
    }
    return result;
}

On each iteration of the loop, the bit in question from the uint16_t is masked out and stored.

That bit is then shifted by its bit position and ORed into the result, then shifted again by its bit position plus 1 and ORed into the result.

Godart answered 10/8, 2016 at 20:18 Comment(0)

A simple loop. Maybe not bit-hacky enough?

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t mask = 3;
  uint32_t y = 0;
  while (x) {
    if (x&1) y |= mask;
    x >>= 1;
    mask <<= 2;
  }
  return y;
}

Tried another that is twice as fast. Still 655/272 as slow as expand_bits(). Appears to be fastest 16 loop iteration solution.

uint32_t thndrwrks_expand(uint16_t x) {
  uint32_t y = 0;
  for (uint16_t mask = 0x8000; mask; mask >>= 1) {
    y <<= 1;
    y |= x&mask;
  }
  y *= 3;
  return y;
}

Merylmes answered 10/8, 2016 at 20:30 Comment(0)

If your concern is performance and simplicity, you are likely best of with a big lookup table (64k entries of 4 bytes each). With that, you can pretty much use any algorithm you like to generate the table, lookup will just be a single memory access.

If that table is too big for your liking, you can split it. For instance, you can use a 8 bit lookup table with 256 entries of 2 bytes each. With that you can perform the entire operation with just two lookups. Bonus is, that this approach allows for type-punning tricks to avoid the hassle of splitting the address with bit operations:

//Implementation defined behavior ahead:
//Works correctly for both little and big endian machines,
//however, results will be wrong on a PDP11...
uint32_t getMask(uint16_t input) {
    assert(sizeof(uint16_t) == 2);
    assert(sizeof(uint32_t) == 4);
    static const uint16_t lookupTable[256] = { 0x0000, 0x0003, 0x000c, 0x000f, ... };

    unsigned char* inputBytes = (unsigned char*)&input;    //legal because we type-pun to char, but the order of the bytes is implementation defined
    char outputBytes[4];
    uint16_t* outputShorts = (uint16_t*)outputBytes;    //legal because we type-pun from char, but the order of the shorts is implementation defined
    outputShorts[0] = lookupTable[inputBytes[0]];
    outputShorts[1] = lookupTable[inputBytes[1]];
    uint32_t output;
    memcpy(&output, outputBytes, 4);    //can't type-pun directly from uint16 to uint32_t due to strict aliasing rules
    return output;
}

The code above works around strict aliasing rules by casting only to/from char, which is an explicit exception to the strict aliasing rules. It also works around the effects of little/big-endian byte order by building the result in the same order as the input was split. However, it still exposes implementation defined behavior: A machine with a byte order of 1, 0, 3, 2, or other middle endian orders, will silently produce wrong results (there have actually been such CPUs like the PDP11...).

Of course, you can split the lookup table even further, but I doubt that would do you any good.

Unclasp answered 10/8, 2016 at 20:51 Comment(10)

With the possibility of cache misses etc, a table is likely to be slower than the bit-twiddling hack solution. I'm not sure how you deduce that the middle-endian orders will produce the wrong result — numerically, the result will be correct (as a 32-bit value), but if you analyze bytes in sequence, then you would get different results, but that would also be true of big-endian vs little-endian systems too. – Rigobertorigor 10/8, 2016 at 20:55

1) Sneaky endian agnostic UV. 2) Would seem more consistent to use uint8_t--> uint8_t outputBytes[4]; 3) suggest uint32 --> uint32_t – Merylmes 10/8, 2016 at 20:55

@Backwards Yes, I thought about using uint8_t. I decided against it to make it clear that I'm actually using a char type, which has that exception in the strict aliasing rules. AFAIK, while the standard requires sizeof(char) to be 1, it does not require char to be exactly eight bit, nor does it require sizeof(uint8_t) to be 1. Consequently, I doubt that I have the license of the standard to assume that uint8_t is the same as unsigned char, and by consequence that the strict aliasing exception applies to uint8_t. – Unclasp 10/8, 2016 at 23:28

@Backwards Of course, I meant uint32_t, not uint32. Fixed it. – Unclasp 10/8, 2016 at 23:31

If uint8_t exist, then sizeof(uint8_t) must be 1 and unsigned char is the same size/range as uint8_t. Yet I think that assertion would engender a language lawyer session. (BTW: answer still has uint32 output;) – Merylmes 10/8, 2016 at 23:36

@JonathanLeffler The difference between the large lookup table and any bit-fiddling solution is, that the lookup table only requires a single operation, while the bit-fiddling takes quite a number of operations. Even thndrwrks' beautiful solution (they've got my upvote) requires 14 operations, which is slower than a cache hitting memory load. Considering the size of the lookup table, we see that it's 256 kiB, which does fit into caches without any problem. However, it does increase the load on the caches, so thndrwrks' solution may indeed be faster. – Unclasp 10/8, 2016 at 23:38

@Backwards Ah, good to know, thanks. Also thanks for spotting that missed uint32 :-) – Unclasp 10/8, 2016 at 23:40

@JonathanLeffler On the PDP11, you would get the high order byte of the input in inputBytes[0], but the result of the lookup would be stored in outputBytes[0] which corresponds to the two low order bytes of the output. The other way round for the low order input byte. In effect, the code would produce a correct big endian value in the outputBytes array, which would then be interpreted in PDP11 middle endian byte order. – Unclasp 10/8, 2016 at 23:48

I'd prefer using bitshifts to combine ((uint32_t)a)<<16|b and split ((uint8_t)x and x >> 8) the two halves instead of using memcopies. – Doc 11/8, 2016 at 6:58

The char vs uint8_t thing is tricky. Since char has at least 8 bits, and uint8_t exactly 8 (if it exists), its size must be 1. But that still doesn't make it a character type and thus still not exempt from the aliasing rules. I've seen a GCC bugtracker thread where somebody proposed taking advantage of the rule that uint8_t must not alias other types. – Doc 11/8, 2016 at 7:5

Try this, where input16 is the uint16_t input mask:

uint32_t input32 = (uint32_t) input16;
uint32_t result = 0;
uint32_t i;
for(i=0; i<16; i++)
{
    uint32_t bit_at_i = (input32 & (((uint32_t)1) << i)) >> i;
    result |= ((bit_at_i << (i*2)) | (bit_at_i << ((i*2)+1)));
}
// result is now the 32 bit expanded mask

Screenplay answered 10/8, 2016 at 20:2 Comment(9)

That code does invoke undefined behaviour in at least one expression (more on certain platforms)! – Stepp 10/8, 2016 at 20:4

Read the standard - section bitshifts. – Stepp 10/8, 2016 at 20:6

That's not an explanation. Can you point out a specific expression of undefined behavior? – Screenplay 10/8, 2016 at 20:8

I suspect that this has to do with bit shifts on a signed integer. @Olaf, could you confirm this? – Zelikow 10/8, 2016 at 20:19

Using uint8_t for bit_at_i is risky — it gets converted to (signed) int in the computation for result. Use uint32_t instead; it avoids problems. You could use 1U instead of 1 in the masking operation too, for similar reasons, though there are few systems left with 16-bit plain int. – Rigobertorigor 10/8, 2016 at 20:21

@JonathanLeffler: 1U will not be sufficient for 16 bit architectures. – Stepp 10/8, 2016 at 20:44

@Olaf: Oh, why not? – Rigobertorigor 10/8, 2016 at 20:44

"That's not an explanation." - It is not. But it is a clear hint where to look for the explanation. Apparently you did not even try to follow my advice. The standard is the only authoritative resource when it comes to C issues like UB. Using it is a good idea and this is a very easy entry to get familar with it. – Stepp 10/8, 2016 at 20:46

@JonathanLeffler: Sorry, I was after the variable. Anyway, still far by most targets have in fact 16 bit int. Still something like 8051 is sold more than most ARM. Add PIC16, AVR, the various HC08/05/12 derivats and the many other 8 or 16 bit MCUs I just don't remember. Then DSPs, etc. – Stepp 10/8, 2016 at 20:49

My solution is meant to run on mainstream x86 PCs and be simple and generic. I did not write this to compete for the fastest and/or shortest implementation. It is just another way to solve the problem submitted by OP.

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#define BITS_TO_EXPAND (4U)
#define SIZE_MAX (256U)

static bool expand_uint(unsigned int *toexpand,unsigned int *expanded);

int main(void)
{
    unsigned int in = 12;
    unsigned int out = 0;
    bool success;
    char buff[SIZE_MAX];

    success = expand_uint(&in,&out);
    if(false == success)
    {
        (void) puts("Error: expand_uint failed");
        return EXIT_FAILURE;
    }
    (void) snprintf(buff, (size_t) SIZE_MAX,"%u expanded is %u\n",in,out);
    (void) fputs(buff,stdout);
    return EXIT_SUCCESS;
}
/*
** It expands an unsigned int so that every bit in a nibble is copied twice
** in the resultant number. It returns true on success, false otherwise.
*/
static bool expand_uint(unsigned int *toexpand,unsigned int *expanded)
{
    unsigned int i;
    unsigned int shifts = 0;
    unsigned int mask;

    if(NULL == toexpand || NULL == expanded)
    {
        return false;
    }
    *expanded = 0;
    for(i = 0; i < BIT_TO_EXPAND; i++)
    {
        mask = (*toexpand >> i) & 1;
        *expanded |= (mask << shifts);
        ++shifts;
        *expanded |= (mask << shifts);
        ++shifts;
    }
    return true;
}

Antinode answered 19/8, 2016 at 13:37 Comment(0)

I am a bit late to the party, but here is a generic algorithm to expand bitmasks. I needed to expand an 8-bit mask to a 64-bit integer... Hopefully someone will find this algorithm useful.

It's probably not the fastest, but it works for any arbitrary numbers of bits in and expansion factor within the hardware limitations.

unsigned long expand_bitmask(unsigned long mask, unsigned nbits_in, unsigned expand_by)
{
    unsigned long result, mask_out;
    int i, shift;

    assert(nbits_in * expand_by <= 8 * sizeof(unsigned));

    // mask input
    mask &= (unsigned long)(-1) >> ((8 * sizeof(unsigned long)) - nbits_in);

    result = 0;     // holds results
    mask_out = 0;   

    for (i = 0, shift = 0; i < nbits_in; ++i, shift += expand_by)
    {
        result   |= (mask << (shift - i));  // the shift differential places the bits
                                            // in the right place.
                                            // equivalent to mask << (i * (shift - 1))
        mask_out |= (1 << shift);           // this will mask the shited bits we want to keep
                                            // equivalent to 1 << (i * shift)
    }

    result &= mask_out;   // wipe out the garbage bits

    // multiply by a mask representing the number of wanted bits.
    result *= (unsigned long)(-1) >> ((8* sizeof(unsigned long)) - expand_by);
    return result;
}

Of, course, since you'll usually know the number of bits in and out, the algorithm can help you precompute the shifts, the clean up mask and factor, for a fairly fast compute time, of 1 shift per bit + 1 and and 1 multiplication total ops.

For the original question this gives:

input_mask &= 0b11;
result = input_mask;
result |= (input_mask << 1);         // 1 * (2 bits out - 1)
result = (result & 0b0101) * 0b0011; // mask has bits set every 2 bits.
                                     // mutiplied by 2 set bits for expansion

Adventuress answered 17/11, 2023 at 12:31 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags