How can I speed up the binary GCD algorithm using __builtin_ctz?
Asked Answered
D

2

9

clang and GCC have a int __builtin_ctz(unsigned) function. This counts the trailing zeros in an integer. The Wikipedia article on this family of functions mentions that the binary GCD algorithm can be sped up using __builtin_ctz, but I don't understand how.

The sample implementation of the binary GCD looks like this:

unsigned int gcd(unsigned int u, unsigned int v)
{
    // simple cases (termination)
    if (u == v)
        return u;

    if (u == 0)
        return v;

    if (v == 0)
        return u;

    // look for factors of 2
    if (~u & 1) // u is even
        if (v & 1) // v is odd
            return gcd(u >> 1, v);
        else // both u and v are even
            return gcd(u >> 1, v >> 1) << 1;

    if (~v & 1) // u is odd, v is even
        return gcd(u, v >> 1);

    // reduce larger argument
    if (u > v)
        return gcd(u - v, v);

    return gcd(v - u, u);
}

My suspicion is that I could use __builtin_ctz as follows:

constexpr unsigned int gcd(unsigned int u, unsigned int v)
{
    // simplified first three ifs
    if (u == v || u == 0 || v == 0)
        return u | v;

    unsigned ushift = __builtin_ctz(u);
    u >>= ushift;

    unsigned vshift = __builtin_ctz(v);
    v >>= vshift;

    // Note sure if max is the right approach here.
    // In the if-else block you can see both arguments being rshifted
    // and the result being leftshifted only once.
    // I expected to recreate this behavior using max.
    unsigned maxshift = std::max(ushift, vshift);

    // The only case which was not handled in the if-else block before was
    // the odd/odd case.
    // We can detect this case using the maximum shift.
    if (maxshift != 0) {
        return gcd(u, v) << maxshift;
    }

    return (u > v) ? gcd(u - v, v) : gcd(v - u, u);
}

int main() {
    constexpr unsigned result = gcd(5, 3);
    return result;
}

Unfortunately this does not work yet. The program results in 4, when it should be 1. So what am I doing wrong? How can I use __builtin_ctz correctly here? See my code so far on GodBolt.

Dent answered 26/8, 2020 at 19:57 Comment(6)
The other sample of binary GCD there essentially has min where you have max, but it works a bit different overallStites
How does it compare to std::gcd? Is this supposed to be faster?Discourteous
@TedLyngmo not sure. I tried to benchmark it but my benchmark segfaulted. Note that the link is to Godbolt because quick-bench doesn't let your share failed benchmarks. Do you know what's wrong with this benchmark? It's not really supposed to be faster anyways, the question is rather about understanding how to use CTZ.Dent
It's a stackoverflow in your gcd. It goes into a very deep recursion @ u=3508125240, v=2952784951. Here are the max, min and shift values.Discourteous
@TedLyngmo thanks for clearing that up. My implementation is faster than std::gcd and BrettHale's is even faster than that. (see my answer for the benchmark results)Dent
@J.Schultke You are welcome. The implementation I tested was apparently broken so I couldn't get any numbers from it more in a couple of tests and then it was really slow, but as it was broken I didn't pay much attention to that. The tests I put in a comment under @BrettHale's answer was indeed much faster than std::gcd.Discourteous
M
5

Here's my iterative implementation from the comments:

While tail-recursive algorithms are often elegant, iterative implementations are almost always faster in practice. (Modern compilers can actually perform this transform in very simple cases.)

unsigned ugcd (unsigned u, unsigned v)
{
    unsigned t = u | v;

    if (u == 0 || v == 0)
        return t; /* return (v) or (u), resp. */

    int g = __builtin_ctz(t);

    while (u != 0)
    {
        u >>= __builtin_ctz(u);
        v >>= __builtin_ctz(v);

        if (u >= v)
            u = (u - v) / 2;
        else
            v = (v - u) / 2;
    }

    return (v << g); /* scale by common factor. */
}

As mentioned, the |u - v| / 2 step is typically implemented as a very efficient, unconditional right shift, e.g., shr r32, to divide by (2) - as both (u), (v) are odd, and therefore |u - v| must be even.

It's not strictly necessary, as the 'oddifying' step: u >>= __builtin_clz(u); will effectively perform this operation in the next iteration.

Supposing that (u) or (v) have a 'random' bit distribution, the probability of (n) trailing zeroes, via tzcnt, is ~ (1/(2^n)). This instruction is an improvement over bsf, the implementation for __builtin_clz prior to Haswell, IIRC.

Marble answered 26/8, 2020 at 23:40 Comment(9)
Like the ARM output of this algorithm where it can use rev/clz to handle the ctz.Prat
quick-bench: clang, gcc - Looking really nice.Discourteous
The 64bit version beats std::gcd by an even bigger factor. Though, on Intel Ice Lake the results could be different, it improves the speed of division.Stites
tzcnt has a false dependencies - actually listed as errata - often inserting xor r, r to break them. I'm not sure if a more recent microarchitecture has fixed this. bsf is much improved, however, so a a lot of compilers are falling back on it.Marble
@harold - well then we can just use: return ((v == 0) ? u : u64_gcd(v, u % v)); ... but I have read that division has been given a serious boost in Icelake.Marble
Is it that this gcd is limited to unsigned numbers that makes it so much faster than std::gcd?Discourteous
@TedLyngmo std::gcd in the GNU library just computes std::gcd(std::abs(u), std::abs(v)). This might result in some small amount of overhead. std::gcd uses Euclid's algorithm, which makes much more of a difference.Dent
@J.Schultke Nice - there's room for improvement in their std::gcd then. :-)Discourteous
The inner loop can be made branch-free, although it probably doesn't matter if you have conditional moves: u = ((u + v) >> (u >= v)) - v; v = ((u + v) >> !(u >= v)) - u;Hulking
D
3

Thanks to helpful commentators, I have found the crucial mistake: I should have used min instead of max

This is the final solution:

#include <algorithm>

constexpr unsigned gcd(unsigned u, unsigned v)
{
    if (u == v || u == 0 || v == 0)
        return u | v;

    // effectively compute min(ctz(u), ctz(v))
    unsigned shift = __builtin_ctz(u | v);
    u >>= __builtin_ctz(u);
    v >>= __builtin_ctz(v);

    const auto &[min, max] = std::minmax(u, v);

    return gcd(max - min, min) << shift;
}

int main() {
    constexpr unsigned g = gcd(25, 15); // g = 5
    return g;
}

This solution also has very nice, nearly branch-free compile output.

Here are some benchmark results of all the answers so far (we actually beat std::gcd): benchmark

Dent answered 26/8, 2020 at 20:26 Comment(11)
By the way you could take the minimum trailing zero count with __builtin_ctz(u | v)Stites
Using max instead of min was a mistake. But it is fine to use the individual shifts. Because the extra factors of 2 you removed from one of them don't change the greatest COMMON divisor.Centrosphere
If I take 2 odd numbers and multiply one by a power of 2, I have not changed the gcd.Centrosphere
It's 'branch-free' only if you don't include the cost of recursive call overheads. As a tail-recursive algorithm, it's relatively easy to transform it into an iterative implementation.Marble
Like @BrettHale's answer better. Compiler is better at improving this and recursion is rarely the fastest answer.Prat
@J.Schultke - you don't need to do it, but since |u - v| is guaranteed to be even, an unconditional left shift by (1) is very fast. The tzcnt and older bsf instructions are not so quick, and the probability of (n) trailing zeroes is effectively (1/(2^n)).Marble
BTW - it's probably better to print the result. For various and contentious reasons, a return code may only yield the low (8) bits; i.e., the godbolt result might effectively return: (g & 0xff)Marble
Unrelated: You can get references to the min and max values like this: const auto&[min, max] = std::minmax(u, v);Discourteous
Maybe my comment above wasn't unrelated after all (or maybe it was just a fluke) but using the above made it slightly faster when I tested (it went from 1.2 times faster than std::gcd to 1.3 times faster): quick-bench.com/q/Ee-lGBlYPwFQnfwSbCy2vLD9WvsDiscourteous
@TedLyngmo that's because (surprisingly) std::minmax really does have different compile output. I would have expected clang to optimize this down to the same code, but it didn't.Dent
@J.Schultke Yeah, I thought it would result in the same thing too :-)Discourteous

© 2022 - 2024 — McMap. All rights reserved.