Determine which single bit in the byte is set

Asked 20/1, 2013 at 21:41 Answered 20/1, 2013 at 22:47

Solved c optimization bit-manipulation bitflags

I have a byte I'm using for bitflags. I know that one and only one bit in the byte is set at any give time.

Ex: unsigned char b = 0x20; //(00100000) 6th most bit set

I currently use the following loop to determine which bit is set:

int getSetBitLocation(unsigned char b) {
  int i=0;
  while( !((b >> i++) & 0x01) ) { ; }
  return i;
}

How do I most efficiently determine the position of the set bit? Can I do this without iteration?

Tiros answered 20/1, 2013 at 21:41 Comment(10)

Assuming an 8-bit byte, the use of a lookup table might help, unless you can make use of a 'count leading zeroes' primitive / instruction. – Adigun 20/1, 2013 at 21:42

"Can I do this without iteration?" Use a lookup table or a switch statement. – Mccowyn 20/1, 2013 at 21:44

I was hoping for a bit-twiddling hack, not an obvious switch statement – Tiros 20/1, 2013 at 21:46

(Make sure to run a performance analysis over real input in the real execution/application context; specific compiler code generation, branch prediction, caching, and other factors come into play and can entirely mitigate any "performance increases". Also, it often Just Doesn't Matter - 97/3.) – Briolette 20/1, 2013 at 21:47

You can get to three comparisons and some math. Will it be actually faster than this loop? Hard to tell. – Course 20/1, 2013 at 21:48

@JanDvorak What are the three comparisons/math? – Tiros 20/1, 2013 at 21:49

What architecture does this need to run on? – Chemotropism 20/1, 2013 at 21:50

If this is your bottleneck, then you won't beat lookup table. I'm guessing that you've not actually done any profiling and are optimising without the benefit of that. – Mccowyn 20/1, 2013 at 21:57

"most efficiently" would be with your compilers built-in function for this, if there is one. Compilers are unlikely to detect what all the crazy tricks mentions in the answers are actually trying to do and convert them to something more sensible. Use those tricks at your own risk. – Jessie 21/1, 2013 at 8:18

Here's the most pure "bit-twiddling hack" that you've asked among all posted answers. It doesn't use lookups or switch/if/else branching. For performance, consider solutions that process more than one byte at a time e.g., similar to solutions for xor operation – Rumanian 22/1, 2013 at 3:29

Can I do this without iteration?

It is indeed possible.

How do I most efficiently determine the position of the set bit?

You can try this algorithm. It splits the char in half to search for the top bit, shifting to the low half each time:

int getTopSetBit(unsigned char b) {
  int res = 0;
  if(b>15){
    b = b >> 4;
    res = res + 4;
  }
  if(b>3){
    b = b >> 2;
    res = res + 2;
  }

  //thanks @JasonD
  return res + (b>>1);
}

It uses two comparisons (three for uint16s, four for uint32s...). and it might be faster than your loop. It is definitely not shorter.

Based on the idea by Anton Kovalenko (hashed lookup) and the comment by 6502 (division is slow), I also suggest this implementation (8-bit => 3-bit hash using a de-Bruijn sequence)

int[] lookup = {7, 0, 5, 1, 6, 4, 3, 2};

int getBitPosition(unsigned char b) {
  // return lookup[(b | (b>>1) | (b>>2) | (b>>4)) & 0x7];
  return lookup[((b * 0x1D) >> 4) & 0x7];
}

or (larger LUT, but uses just three terms instead of four)

int[] lookup = {0xFF, 0, 1, 4, 2, 0xFF, 5, 0xFF, 7, 3, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF};

int getBitPosition(unsigned char b) {
  return lookup[(b | (b>>3) | (b>>4)) & 0xF];
}

Course answered 20/1, 2013 at 21:54 Comment(8)

You can remove the last if() and just add b-1 – Chemotropism 20/1, 2013 at 21:57

@Chemotropism not b-1, but b>>1. b could be 2 or 3. – Course 20/1, 2013 at 21:59

@JanDvorak clever, this might be more efficient – Tiros 20/1, 2013 at 22:0

@JanDvorak He said only 1 bit was set, so it should only be 1 or 2. Your alternative is more general though. – Chemotropism 20/1, 2013 at 22:1

@Chemotropism you are right; b-1 would do as well in the restricted case (and it would be better if you want to return -1 if no bit is set) – Course 20/1, 2013 at 22:3

@JanDvorak Your hashed lookup-table lookup[((b * 0x1D) >> 4) & 0x7]; passed my 8 test-cases for bytes { 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80 } when checking for the location of the only set bit. Very elegant! Might I ask where you got the hash from? Could you link to the De-Brujin sequences? – Tiros 20/1, 2013 at 23:51

@awashburn I was looking for a set of rotations that, when summed, would not produce collisions in the bottom three or four bits. This turns out to be isomorphic to finding a cyclic sequence of eight bits that have no duplicate trigrams - which (all n-grams are unique) is exactly the requirement for a de-Brujin sequence. Finding one consisted of choosing for each rotation to be active or not. It then turned out to be possible to replace rotations by shifts (if a bit becomes interesting, it's due to a left shift). Cf. en.wikipedia.org/wiki/De_Bruijn_sequence – Course 21/1, 2013 at 0:4

Cool trick! I had always used something like lut[foo % 67] to turn foo into log2(foo) when foo is a power of two. – Paddie 21/1, 2013 at 2:52

Lookup table is simple enough, and you can reduce its size if the set of values is sparse. Let's try with 11 elements instead of 128:

unsigned char expt2mod11_bits[11]={0xFF,0,1,0xFF,2,4,0xFF,7,3,6,5};
unsigned char pos = expt2mod11_bits[b%11];
assert(pos < 8);
assert(1<<pos == b);

Of course, it's not necessarily more effective, especially for 8 bits, but the same trick can be used for larger sizes, where full lookup table would be awfully big. Let's see:

unsigned int w; 
....
unsigned char expt2mod19_bits[19]={0xFF,0,1,13,2,0xFF,14,6,3,8,0xFF,12,15,5,7,11,4,10,9};
unsigned char pos = expt2mod19_bits[w%19];
assert(pos < 16);
assert(1<<pos == w);

Unabridged answered 20/1, 2013 at 22:1 Comment(15)

Nice idea for larger sizes... but for 8 bit I suppose that the modulo operation is going to be a problem. – Fluorspar 20/1, 2013 at 22:8

Interesting -- did you discover 11 by trial and error, or is there some principle involved? – Theodosia 20/1, 2013 at 22:9

@Fluorspar why do you think it's a problem? You just need a module that avoids collisions. – Course 20/1, 2013 at 22:9

@JanDvorak: because it's going to be slow – Fluorspar 20/1, 2013 at 22:10

@Fluorspar mod is one instruction on x86. I predict this will take two clock cycles at most (modulo and lookup). – Course 20/1, 2013 at 22:11

On x86, I would try inline assembly with BSF. – Unabridged 20/1, 2013 at 22:15

@AntonKovalenko except mod-11 is platform-agnostic while BSF isn't. – Course 20/1, 2013 at 22:19

@JanDvorak: you're a bit too optimistic. Division has always been annoyingly slow... so slow that when designing the first Pentium processor they even tried to cut some corners ;-). If you need the quotient the division can be traded for a multiplication in many cases, but if you need the remainder that trick doesn't work. And not even integer multiplication IIRC is 2 cycles on any processor I used. – Fluorspar 20/1, 2013 at 22:22

@Fluorspar then binary search might be the fastest solution. I'm wondering if it's possible to create a faster hash function than mod 11. – Course 20/1, 2013 at 22:24

Division by a constant can be optimised by modern compilers to remove the division. Try gcc -S -O -x c - -o - <<< "int f(int x) { return x % 11; }" – Boak 20/1, 2013 at 22:30

@JanDvorak It doesn't fit in the comment area, but it starts with a multiplication by 780903145. Note that that number is exactly 0x200000003 / 11. – Boak 20/1, 2013 at 22:33

@JanDvorak: if that computation is the bottleneck then it has to be done an awful number of times. In that case I think that using some L1 cache is not going to be a problem and a lookup in a static 256-entries table is just one operation. Only a real check with real data can however provide an answer. Also I think that it could make sense for example process two or four bytes at a time but more context is needed... – Fluorspar 20/1, 2013 at 22:43

@hvd would a hash function based on two additions and two bitshifts be faster than a hash function based on a single modulo? – Course 20/1, 2013 at 22:51

@6502: it still wastes a cache slot. Which could be used for different parts (the outer code) of the program. – Tartlet 20/1, 2013 at 22:52

I added a solution that uses a different hash than %11 to my answer. It should be faster than this one. – Course 20/1, 2013 at 23:0

This is a quite common problem for chess programs that use 64 bits to represent positions (i.e. one 64-bit number to store where are all the white pawns, another for where are all the black ones and so on).

With this representation there is sometimes the need to find the index 0...63 of the first or last set bit and there are several possible approaches:

Just doing a loop like you did
Using a dichotomic search (i.e. if x & 0x00000000ffffffffULL is zero there's no need to check low 32 bits)
Using special instruction if available on the processor (e.g. bsf and bsr on x86)
Using lookup tables (of course not for the whole 64-bit value, but for 8 or 16 bits)

What is faster however really depends on your hardware and on real use cases. For 8 bits only and a modern processor I think that probably a lookup table with 256 entries is the best choice...

But are you really sure this is the bottleneck of your algorithm?

Fluorspar answered 20/1, 2013 at 22:5 Comment(3)

...FULL means "unsigned long long"? – Course 20/1, 2013 at 22:6

ULL is the suffix, the F is just last hexadecimal digit, I'll edit to make it more clear – Fluorspar 20/1, 2013 at 22:11

I meant, is it actually compilable, or just a placeholder/elliplis :-) – Course 20/1, 2013 at 22:13

unsigned getSetBitLocation(unsigned char b) {
  unsigned pos=0;
  pos = (b & 0xf0) ? 4 : 0; b |= b >>4;
  pos += (b & 0xc) ? 2 : 0; b |= b >>2;
  pos += (b & 0x2) ? 1 : 0; 
  return pos; 
}

It would be hard to do it jumpfree. Maybe with the Bruin sequences ?

Tartlet answered 20/1, 2013 at 21:55 Comment(0)

Based on log2 calculation in Find the log base 2 of an N-bit integer in O(lg(N)) operations:

int getSetBitLocation(unsigned char c) {
  // c is in {1, 2, 4, 8, 16, 32, 64, 128}, returned values are {0, 1, ..., 7}
  return (((c & 0xAA) != 0) |
          (((c & 0xCC) != 0) << 1) |
          (((c & 0xF0) != 0) << 2));
}

Rumanian answered 20/1, 2013 at 22:47 Comment(7)

This relies on the fact that != returns 0 or 1. Is that guaranteed by the spec? I also guess it doesn't really avoid branching due to how != is implemented. – Course 20/1, 2013 at 22:49

@JanDvorak: yes, it was added in c99. (I used the ternary to be downward compatible with c89/c90) And it indeeds looks not jumpfree. (but there are some strange instructions nowadays) – Tartlet 20/1, 2013 at 22:54

@JanDvorak: yes, it is guaranteed even before c99. Also I don't see any branching in the code. You could see its assembly online. Could you point out the branching instructions? – Rumanian 22/1, 2013 at 2:52

@J.F.Sebastian Without optimisation, wrapped in int main(char c), the three !=s turn to set if not equal on line 0x14 (impressive), 0x25: je 0x2E / 0x2C: jmp 0x33 (ouch, a branching instruction), and the same combo appears at 0x40. With optimisations turned on, the jes turn to subtract with borrows (impressive) ARM uses a move if not equal / move if equals combo (nice instructions). Thanks for the link to the compiler. – Course 22/1, 2013 at 8:15

Note that by deBrujin-based solution turns to mere five operations: MOVZB, IMUL, SAR, AND, MOV – Course 22/1, 2013 at 8:35

@JanDvorak: you can paste the code as is. The online service accepts the function without naming it main(). There is no point worrying about branching if you disable optimizations in the compiler. De Bruijn sequences look interesting. – Rumanian 22/1, 2013 at 9:34

I was surprised it didn't mung my IMUL to a series of additions. – Course 22/1, 2013 at 9:37

Easiest thing is to create a lookup table. The simplest one will be sparse (having 256 elements) but it would technically avoid iteration.

This comment here technically avoids iteration, but who are we kidding, it is still doing the same number of checks: How to write log base(2) in c/c++

Closed form would be log2(), a la, log2() + 1 But I'm not sure how efficient that is - possibly the CPU has an instruction for taking base 2 logrithms?

Ilianailine answered 20/1, 2013 at 21:50 Comment(4)

FYL2X instruction v/s bit-shifts? Not sure there. – Mun 20/1, 2013 at 21:57

The memory requirements of a sparse look-up table with 255 elements do not outweigh the minor performance optimization. – Tiros 20/1, 2013 at 22:1

How many elements will a sparse lookup table have when CHAR_BIT == 16, or when CHAR_BIT == 32? – Vicariate 20/1, 2013 at 22:3

“possibly the CPU has an instruction for taking base 2 logrithms?” There is an instruction to count leading zeroes in most processors. – Virgil 21/1, 2013 at 2:45

if you define

const char bytes[]={1,2,4,8,16,32,64,128}

and use

struct byte{
char data;
int pos;
}
void assign(struct byte b,int i){

b.data=bytes[i];
b.pos=i
}

you don't need to determine the position of the set bit

Tabes answered 20/1, 2013 at 21:55 Comment(2)

you don't need to determine the position of the set bit -- I'd like to dispute – Course 20/1, 2013 at 21:56

He's fighting the OPPOSITE problem... given the value finding the index. Scanning your table is more or less like just checking the bits. – Fluorspar 20/1, 2013 at 22:4

A lookup table is fast and easy when CHAR_BIT == 8, but on some systems, CHAR_BIT == 16 or 32 and a lookup table becomes insanely bulky. If you're considering a lookup table, I'd suggest wrapping it; make it a "lookup table function", instead, so that you can swap the logic when you need to optimise.

Using divide and conquer, by performing a binary search on a sorted array, involves comparisons based on log2 CHAR_BIT. That code is more complex, involving an initialisation of an array of unsigned char to use as a lookup table for a start. Once you have such the array initialised, you can use bsearch to search it, for example:

#include <stdio.h>
#include <stdlib.h>
void uchar_bit_init(unsigned char *table) {
    for (size_t x = 0; x < CHAR_BIT; x++) {
        table[x] = 1U << x;
    }
}
int uchar_compare(void const *x, void const *y) {
    char const *X = x, *Y = y;
    return (*X > *Y) - (*X < *Y);
}
size_t uchar_bit_lookup(unsigned char *table, unsigned char value) {
    unsigned char *position = bsearch(lookup, c, sizeof lookup, 1, char_compare);
    return position ? position - table + 1 : 0;
}
int main(void) {
    unsigned char lookup[CHAR_BIT];
    uchar_bit_init(lookup);
    for (;;) {
        int c = getchar();
        if (c == EOF) { break; }
        printf("Bit for %c found at %zu\n", c, uchar_bit_lookup(lookup, c));
    }
}

P.S. This sounds like micro-optimisation. Get your solution done (abstracting the operations required into these functions), then worry about optimisations based on your profiling. Make sure your profiling targets the system that your solution will run on if you're going to focus on micro-optimisations, because the efficiency of micro-optimisations differ widely as hardware differs even slightly... It's usually a better idea to buy a faster PC ;)

Vicariate answered 20/1, 2013 at 21:59 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags