Branches aren't necessarily expensive -- it's really mispredicted branches that are expensive1.
So, let's start with the loop. It's infinite, so it's always taken. Since it's always taken, it's also always predicted as taken, so it's cheap.
Only one other branch is ever taken for any given input. That is to say, you do one test after another, and until you reach the one that matches the magnitude of the input number all the branches are not taken (i.e., the if
condition will be false).
Assuming (for example) a random mix of input numbers with a maximum of, say, 16 digits, we end up taking approximately one of the four branches one out of 4 iterations of the loop. We only take a branch (on average) about one out of 16 tests, and a decent branch predictor will probably predict them all as not-taken nearly all the time. The result is that we probably end up with exactly one mis-predicted branch in the entire computation.
As a rule of thumb, figure that a correctly predicted branch takes around 1 clock, and a mis-predicted branch takes around 20-30 clocks. So, for a 16-digit number we end up with something like 15 digits + 4 loop iterations = 19 correctly predicted branches + 1 mis-predicted branch, for a total of something like 39-49 clocks total. For, say, a 2-digit number, we end up around 1+20=21 clocks.
The obvious alternative would be to divide by 10 and check the remainder every iteration. Division is relatively expensive -- for example, a 64-bit division can take around 26-86 cycles on an i7. For the sake of simplicity, let's assume an average of 40. So, for a 16-digit number we can expect around 16*40 = ~640 clocks for the divisions alone. Even at best, let's assume the 2-digit number that requires only 26 clocks per division, so we end up at 52 clocks total.
Alternative
Depending on exactly why you need to do this, you may be able to use an approximation that's a lot faster. For example, if you're counting digits so you can allocate enough memory to hold the result, you can count the number of octal (base 8) digits much faster. This will give you a result that's a little too large part of the time, so you allocate an extra byte once in a while.
We can also cheat a tiny bit. We know the decimal representation will never exceed 18 digits for a 64-bit number, so we can look at only 18*3 = 54 bits of the input.
But that approximation can be really fast. For example:
input &= 0777777777777777777;
while (input != 0) {
++digits;
input >>= 3;
}
This has a maximum of 18 iterations--but since (as noted in the linked article) "most numbers are small", we expect it to execute a lot less than that most of the time. But the body of this loop uses only extremely elementary operations, so even in the worst case, we'd expect only about 4 clocks per iteration (and that would be on something like a Pentium 4, that can only shift by one bit at a time, so a 3-bit shift takes 3 successive shift instructions). For a recent processor, even the worst case is probably around the speed of a single divide instruction (and we'd generally expect a lot less).
If we did expect a lot of the numbers to be large, we could use a (sort of) binary search instead. For example, let's consider a 32-bit number. We're going to start with the same cheat as above, looking only at enough bits to give a count as high as the largest decimal number would need (9 digits, so 27 bits).
So let's start by separating that into 15 bits in the upper "half" and 12 bits in the lower half:
input &= 0777777777; // keep only the bottom 27 bits
if (input & 0777770000) {
digits += 5;
input >>= 12;
}
Then we basically just repeat that pattern for 15 bits, and so on down to 3 bits.
if (input & 077700) {
digits += 2;
input >>= 6;
}
if (input & 0770) {
++digits;
input >> = 3;
}
if (input & 070) {
input >>= 3;
++digits;
}
if (input != 0) {
++digits;
}
As above, this uses only really simple operations, and the two instructions in the body of each if
statement can execute in parallel on a superscalar processor and if we have predicated instructions (e.g., ARM) we can avoid any mis-predicted branches. As a wild guess, the entire sequence can be on the same general order as a single div
instruction.
In fact, this is at the point that speed will likely be limited primarily by the speed at which we can read the data from memory, so if we wanted to optimize further, we'd probably want to focus primarily on improving cache hit rate.
But as noted up-front: that does depend on being able to use an approximation that might be slightly wrong part of the time. If we can't tolerate that, it's not an option.
Summary
Even in very close to the best case for it, division still ends up slower than nearly the worst case for comparisons. Most of the comparisons end up predicted correctly, so we typically end up with only one expensive (mis-predicted) branch.
If a reasonable approximation (for some definition of "reasonable") is adequate, it's pretty easy to improve speed quite a lot more.
1. This, of course, is assuming a modern, relatively high-end processor. On a really old processor (or a low-end embedded processor) you probably don't have a branch predictor at all, so all branches tend to be quite expensive. At the same time, such a processor may not have a divide instruction at all, and if it does, it's probably quite slow. In short, both branches and division take a lot more clocks than on a modern processor, but a branch is still usually quite a bit faster than a division.
if
is a branch, butwhile
also has a branch. And there's 4x fewer of those now. In the simples case it just reordered branches, and reduces div/increment ops. In the realistic scenarios (with branch prediction?) it will allow the pipeline to remain filled because the conditions don't actually branch, whereas thewhile
always branches – Mellissamellitz