Optimization: Expensive branching vs cheap comparison for counting base10 digits (ilog10)
Asked Answered
M

4

5

This is a great article which talks about low level optimization techniques and shows an example where the author converts expensive divisions into cheap comparisons. https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920

For those who don't want to click, essentially he converted this:

uint32_t digits10(uint64_t v) {
    uint32_t result = 0;
    do {
        ++result;
        v /= 10;
    } while (v);
    return result;
}

Into this:

uint32_t digits10(uint64_t v) {
  uint32_t result = 1;
  for (;;) {
    if (v < 10) return result;
    if (v < 100) return result + 1;
    if (v < 1000) return result + 2;
    if (v < 10000) return result + 3;
    // Skip ahead by 4 orders of magnitude
    v /= 10000U;
    result += 4;
  }
}

Resulting in up to a 6 times speed up.

While comparisons are very cheap, I've always heard that branches are very expensive because they can cause pipeline stalls. Because of the conventional wisdom about branching, I never would have considered an approach like this.

Why is branching not a bottleneck in this case? Is it because we return right after the each of the comparisons? Is it because the code size here is small and thus there is not too much for the processor to mispredict? In what cases would it be a bottleneck and start to dominate the cost of the divisions? The author never speaks about this.

Can anyone resolve the apparent contention between cheap comparisons and expensive branches? Of course the golden rule of optimization is that one must always measure. However, it would at least be good to have some intuition about this issue so that one could use comparisons intelligently when trying to come up with new approaches to making code faster.


Related: performance of log10 function returning an int has fast code for this problem, including a fast strategy that adds a 0/1 compare result instead of branching on it, to fixup a table lookup based on a clz ilog2 result. But answers there don't discuss branch that distinction or the general issue of branch mispredictions if this looping function is called on numbers with unpredictable randomly distributed magnitudes.

Muncy answered 7/8, 2013 at 0:57 Comment(5)
Erm. It is reducing branches. if is a branch, but while also has a branch. And there's 4x fewer of those now. In the simples case it just reordered branches, and reduces div/increment ops. In the realistic scenarios (with branch prediction?) it will allow the pipeline to remain filled because the conditions don't actually branch, whereas the while always branchesMellissamellitz
What exactly do you mean by the "conditions don't actually branch?" if(v < 10) sure looks like a branch to me.Muncy
Depending on generated assembly, one of the "branches" won't actually branch (EIP will just be incremented as if there was a noop)Mellissamellitz
bit.ly/17wg3WT Doesn't look like any of the branches are optimized away on gcc or clang. I believe they used gcc 4.7 at facebook.Muncy
Division is the most expensive instruction of all. A pipeline stall is not as expensive.Gemmation
A
12

Branches aren't necessarily expensive -- it's really mispredicted branches that are expensive1.

So, let's start with the loop. It's infinite, so it's always taken. Since it's always taken, it's also always predicted as taken, so it's cheap.

Only one other branch is ever taken for any given input. That is to say, you do one test after another, and until you reach the one that matches the magnitude of the input number all the branches are not taken (i.e., the if condition will be false).

Assuming (for example) a random mix of input numbers with a maximum of, say, 16 digits, we end up taking approximately one of the four branches one out of 4 iterations of the loop. We only take a branch (on average) about one out of 16 tests, and a decent branch predictor will probably predict them all as not-taken nearly all the time. The result is that we probably end up with exactly one mis-predicted branch in the entire computation.

As a rule of thumb, figure that a correctly predicted branch takes around 1 clock, and a mis-predicted branch takes around 20-30 clocks. So, for a 16-digit number we end up with something like 15 digits + 4 loop iterations = 19 correctly predicted branches + 1 mis-predicted branch, for a total of something like 39-49 clocks total. For, say, a 2-digit number, we end up around 1+20=21 clocks.

The obvious alternative would be to divide by 10 and check the remainder every iteration. Division is relatively expensive -- for example, a 64-bit division can take around 26-86 cycles on an i7. For the sake of simplicity, let's assume an average of 40. So, for a 16-digit number we can expect around 16*40 = ~640 clocks for the divisions alone. Even at best, let's assume the 2-digit number that requires only 26 clocks per division, so we end up at 52 clocks total.

Alternative

Depending on exactly why you need to do this, you may be able to use an approximation that's a lot faster. For example, if you're counting digits so you can allocate enough memory to hold the result, you can count the number of octal (base 8) digits much faster. This will give you a result that's a little too large part of the time, so you allocate an extra byte once in a while.

We can also cheat a tiny bit. We know the decimal representation will never exceed 18 digits for a 64-bit number, so we can look at only 18*3 = 54 bits of the input.

But that approximation can be really fast. For example:

input &= 0777777777777777777;

while (input != 0) {
   ++digits;
   input >>= 3;
}

This has a maximum of 18 iterations--but since (as noted in the linked article) "most numbers are small", we expect it to execute a lot less than that most of the time. But the body of this loop uses only extremely elementary operations, so even in the worst case, we'd expect only about 4 clocks per iteration (and that would be on something like a Pentium 4, that can only shift by one bit at a time, so a 3-bit shift takes 3 successive shift instructions). For a recent processor, even the worst case is probably around the speed of a single divide instruction (and we'd generally expect a lot less).

If we did expect a lot of the numbers to be large, we could use a (sort of) binary search instead. For example, let's consider a 32-bit number. We're going to start with the same cheat as above, looking only at enough bits to give a count as high as the largest decimal number would need (9 digits, so 27 bits).

So let's start by separating that into 15 bits in the upper "half" and 12 bits in the lower half:

input &= 0777777777; // keep only the bottom 27 bits

if (input & 0777770000) {
    digits += 5;
    input >>= 12;
}

Then we basically just repeat that pattern for 15 bits, and so on down to 3 bits.

if (input & 077700) {
    digits += 2;
    input >>= 6;
}
if (input & 0770) {
    ++digits;
    input >> = 3;
}
if (input & 070) {
    input >>= 3;
    ++digits;
}
if (input != 0) {
    ++digits;
}

As above, this uses only really simple operations, and the two instructions in the body of each if statement can execute in parallel on a superscalar processor and if we have predicated instructions (e.g., ARM) we can avoid any mis-predicted branches. As a wild guess, the entire sequence can be on the same general order as a single div instruction.

In fact, this is at the point that speed will likely be limited primarily by the speed at which we can read the data from memory, so if we wanted to optimize further, we'd probably want to focus primarily on improving cache hit rate.

But as noted up-front: that does depend on being able to use an approximation that might be slightly wrong part of the time. If we can't tolerate that, it's not an option.

Summary

Even in very close to the best case for it, division still ends up slower than nearly the worst case for comparisons. Most of the comparisons end up predicted correctly, so we typically end up with only one expensive (mis-predicted) branch.

If a reasonable approximation (for some definition of "reasonable") is adequate, it's pretty easy to improve speed quite a lot more.


1. This, of course, is assuming a modern, relatively high-end processor. On a really old processor (or a low-end embedded processor) you probably don't have a branch predictor at all, so all branches tend to be quite expensive. At the same time, such a processor may not have a divide instruction at all, and if it does, it's probably quite slow. In short, both branches and division take a lot more clocks than on a modern processor, but a branch is still usually quite a bit faster than a division.
Atory answered 7/8, 2013 at 2:32 Comment(6)
Division by a constant can be optimized out by multiplication with its multiplicative inverse. So it's a lot cheaper than a simple division. However multiplication is still more expensive than comparison and addition, so reducing the number of multiplications may also result in faster executionTonguetied
@phuclv: And you can loop the other way, starting with n=1 and doing n*=1000, with n*10 and n*100 inside the loop (e.g. to optimize for a machine with a pipelined multiplier). Also to avoid division entirely, although as you say that's not a big deal for a constant with modern compilers. Hrm, also possibly introducing overlow. Maybe divide once (by 10 or 100) to have two things to compare against after every n *= 100, but you'd have to also check for n < old_n to detect unsigned wrapping of the value, if you didn't check the largest bracket outside the loop.Avril
If you're counting octal digits, you could use (32 - std::countl_zeros(x)) / 3 or something with numeric_limits, but maybe rounding up instead of down. Anyway, a bit-scan instead of a loop is the way to go on modern CPUs that have a fast instruction for this. (RISC-V without extension B being one of the exceptions). Of course that makes it branchless without even trying, which is likely a good thing for performance.Avril
Another technique is using countl_zeros to index a table of log10 results, with each entry having a power 10 to compare against to account for an off-by-1 as some values with the same leading-bit position can be just above or below a power of 10. e.g. Stephen Cannon's answer on performance of log10 function returning an intAvril
@PeterCordes: The good point of counting octal digits is that in most cases, it just won't matter. Practically any implementation is going to be fast enough that it''s no longer a bottleneck. But also be careful with ` (32 - std::countl_zeros(x)) / 3 `--with poor optimization, that division by 3 could end up slower than the longer sequences I gave.Atory
@JerryCoffin: I'm of course assuming a compiler will optimize it to a multiplicative inverse since it's a constant. In a special case of a particularly dumb compiler, or a machine without a usably fast single-byte multiply, you might do something else. I'm also assuming that countl_zeros will compile to something nice, like 1 or 2 instructions. If it was going to use a bithack without special CPU instruction support, yeah you might as well work in groups of 3 bits yourself instead of getting an exact bit-scan result.Avril
C
1

the first implementation actually branches more, even if it only has one branch point.

although, just as a matter of preference in coding style, I would use the first implementation. A collection of similar branches might well perform better, but it's still more code, and it looks like it was written thoughtlessly (as a matter of fact, why did it retain result?). And what if I want more than five digits? :|

Calabar answered 7/8, 2013 at 3:30 Comment(0)
A
1

Given the nature of the function, you could eliminate most branches as follows:

uint32_t digits10(uint64_t v) {
  uint32_t result = 1;
  do {
    result += (v < 10) + (v < 100) + (v < 1000) + (v < 10000);
    if( result < 4 ) return result;
    v /= 10000U;
  }while (true);
}

In the time it takes to do the division, instruction level parallelism could potentially do the other mathematical comparisons and updated the result. Since the division doesn't depend upon the result the branch predictor could race ahead to the next iteration of the loop. Since comparison operations have similar speed to subtraction/addition and the CPU can do 4 of those operations per cycle, you may be able to extend it further.

uint32_t digits10(uint64_t v) {
  uint32_t result = 1;
  do {
    result += (v < 10) + (v < 100) + (v < 1000) + (v < 10000)
            + (v < 100000) + (v < 1000000) + (v < 1000000) + (v < 1000000);
    if( result < 8 ) return result;
    v /= 1000000U;
  } while (true);
}

If smaller values are more common, then a short-circuit result can avoid the division and may be more likely to be predicted. You would have to measure to find the right balance.

Assizes answered 10/1 at 19:9 Comment(4)
If you're talking about x86-64, it takes 2 or 3 instructions to materialize a compare result in an integer register where you can add it. (xor eax,eax / cmp ... / setb al. Reusing the same zeroed register whose high 56 bits are zero would create an output dependency for later compares, but save an xor-zeroing uop. Also, with unsigned compares, the carry flag has the result which allows cmp / adc rax, 0 to add CF to another register)Avril
I can't tell if you are suggesting a way to solve the dependency or to show how it is unsolvable. There may also be a way to use SIMD to do the compares in a single instruction.Assizes
I'm saying that (v<10) + (v<100) + ... isn't quite as cheap for the compiler to implement as you're suggesting, except on MIPS / RISC-V where compare instructions produce a 0 / 1 in an integer register already. And maybe on more CPUs if the compiler is clever about using the carry flag and add-with-carry. x86-64 GCC isn't, clang is. godbolt.org/z/9csETcsqq Clang also auto-vectorizes the loop after the first 4 iterations, broadcasting the divide (mulx/shr) result for a packed compare / movemask / popcount.Avril
So that's interesting, but needs a vector constant; if you don't mind loading static data, the tables for lzcnt + a fixup (performance of log10 function returning an int) also fit in a cache line (or maybe two) and that's constant-time branchless.Avril
S
-2

The algorithm is mostly comparisons. The only explicit branch is when returning.

The gains is mostly from avoiding a costly division per digit which could take over 100 clock cycles each. One could make a case that since the max uint64_t value has 22 decimal digits that unrolling the loop into 22 comparisons would have been the most efficient way.

Suiting answered 7/8, 2013 at 1:12 Comment(5)
"The only explicit branch is when returning." Are you suggesting then that you only pay for branches when they are actually taken, not when they aren't?Muncy
You always pay for the condition evaluation. However, the instruction pipeline doesn't need to be flushed unless the branch is taken. (Note that modern CPUs can do branch prediction which might complicate the situation a bit, but the basics still stand: the pipeline either is flushed or remains saturated)Mellissamellitz
If the author is to be believed, paying for the condition is cheap, almost free even so that's no problem. So just so we understand everything. The actual performance problem with branching we are trying to avoid is flushing the pipeline. Is flushing the only potential slowdown or are there other consequences of branching?Muncy
Cache locality, in general. I don't see that being a factor here. In theory false sharing could harm other threads, but the mutating variables are thread local (on the stack) so it doesn't apply here.Mellissamellitz
Integer division by a constant is transformed into multiplication by optimizing compilers (ref) so it's not quite so bad as "100 clock cycles each."Gaultheria

© 2022 - 2024 — McMap. All rights reserved.