How can I speed up the binary GCD algorithm using __builtin_ctz?

Asked 26/8, 2020 at 19:57 Answered 26/8, 2020 at 23:40

Solved c++algorithm bit-manipulation built-in greatest-common-divisor

clang and GCC have a int __builtin_ctz(unsigned) function. This counts the trailing zeros in an integer. The Wikipedia article on this family of functions mentions that the binary GCD algorithm can be sped up using __builtin_ctz, but I don't understand how.

The sample implementation of the binary GCD looks like this:

unsigned int gcd(unsigned int u, unsigned int v)
{
    // simple cases (termination)
    if (u == v)
        return u;

    if (u == 0)
        return v;

    if (v == 0)
        return u;

    // look for factors of 2
    if (~u & 1) // u is even
        if (v & 1) // v is odd
            return gcd(u >> 1, v);
        else // both u and v are even
            return gcd(u >> 1, v >> 1) << 1;

    if (~v & 1) // u is odd, v is even
        return gcd(u, v >> 1);

    // reduce larger argument
    if (u > v)
        return gcd(u - v, v);

    return gcd(v - u, u);
}

My suspicion is that I could use __builtin_ctz as follows:

constexpr unsigned int gcd(unsigned int u, unsigned int v)
{
    // simplified first three ifs
    if (u == v || u == 0 || v == 0)
        return u | v;

    unsigned ushift = __builtin_ctz(u);
    u >>= ushift;

    unsigned vshift = __builtin_ctz(v);
    v >>= vshift;

    // Note sure if max is the right approach here.
    // In the if-else block you can see both arguments being rshifted
    // and the result being leftshifted only once.
    // I expected to recreate this behavior using max.
    unsigned maxshift = std::max(ushift, vshift);

    // The only case which was not handled in the if-else block before was
    // the odd/odd case.
    // We can detect this case using the maximum shift.
    if (maxshift != 0) {
        return gcd(u, v) << maxshift;
    }

    return (u > v) ? gcd(u - v, v) : gcd(v - u, u);
}

int main() {
    constexpr unsigned result = gcd(5, 3);
    return result;
}

Unfortunately this does not work yet. The program results in 4, when it should be 1. So what am I doing wrong? How can I use __builtin_ctz correctly here? See my code so far on GodBolt.

Dent answered 26/8, 2020 at 19:57 Comment(6)

The other sample of binary GCD there essentially has min where you have max, but it works a bit different overall – Stites 26/8, 2020 at 20:12

How does it compare to std::gcd? Is this supposed to be faster? – Discourteous 26/8, 2020 at 20:38

@TedLyngmo not sure. I tried to benchmark it but my benchmark segfaulted. Note that the link is to Godbolt because quick-bench doesn't let your share failed benchmarks. Do you know what's wrong with this benchmark? It's not really supposed to be faster anyways, the question is rather about understanding how to use CTZ. – Dent 26/8, 2020 at 20:56

It's a stackoverflow in your gcd. It goes into a very deep recursion @ u=3508125240, v=2952784951. Here are the max, min and shift values. – Discourteous 27/8, 2020 at 5:44

@TedLyngmo thanks for clearing that up. My implementation is faster than std::gcd and BrettHale's is even faster than that. (see my answer for the benchmark results) – Dent 27/8, 2020 at 13:16

@J.Schultke You are welcome. The implementation I tested was apparently broken so I couldn't get any numbers from it more in a couple of tests and then it was really slow, but as it was broken I didn't pay much attention to that. The tests I put in a comment under @BrettHale's answer was indeed much faster than std::gcd. – Discourteous 27/8, 2020 at 13:20

Here's my iterative implementation from the comments:

While tail-recursive algorithms are often elegant, iterative implementations are almost always faster in practice. (Modern compilers can actually perform this transform in very simple cases.)

unsigned ugcd (unsigned u, unsigned v)
{
    unsigned t = u | v;

    if (u == 0 || v == 0)
        return t; /* return (v) or (u), resp. */

    int g = __builtin_ctz(t);

    while (u != 0)
    {
        u >>= __builtin_ctz(u);
        v >>= __builtin_ctz(v);

        if (u >= v)
            u = (u - v) / 2;
        else
            v = (v - u) / 2;
    }

    return (v << g); /* scale by common factor. */
}

As mentioned, the |u - v| / 2 step is typically implemented as a very efficient, unconditional right shift, e.g., shr r32, to divide by (2) - as both (u), (v) are odd, and therefore |u - v| must be even.

It's not strictly necessary, as the 'oddifying' step: u >>= __builtin_clz(u); will effectively perform this operation in the next iteration.

Supposing that (u) or (v) have a 'random' bit distribution, the probability of (n) trailing zeroes, via tzcnt, is ~ (1/(2^n)). This instruction is an improvement over bsf, the implementation for __builtin_clz prior to Haswell, IIRC.

Marble answered 26/8, 2020 at 23:40 Comment(9)

Like the ARM output of this algorithm where it can use rev/clz to handle the ctz. – Prat 27/8, 2020 at 0:13

quick-bench: clang, gcc - Looking really nice. – Discourteous 27/8, 2020 at 6:25

The 64bit version beats std::gcd by an even bigger factor. Though, on Intel Ice Lake the results could be different, it improves the speed of division. – Stites 27/8, 2020 at 6:52

tzcnt has a false dependencies - actually listed as errata - often inserting xor r, r to break them. I'm not sure if a more recent microarchitecture has fixed this. bsf is much improved, however, so a a lot of compilers are falling back on it. – Marble 27/8, 2020 at 8:0

@harold - well then we can just use: return ((v == 0) ? u : u64_gcd(v, u % v)); ... but I have read that division has been given a serious boost in Icelake. – Marble 27/8, 2020 at 8:4

Is it that this gcd is limited to unsigned numbers that makes it so much faster than std::gcd? – Discourteous 27/8, 2020 at 13:25

@TedLyngmo std::gcd in the GNU library just computes std::gcd(std::abs(u), std::abs(v)). This might result in some small amount of overhead. std::gcd uses Euclid's algorithm, which makes much more of a difference. – Dent 27/8, 2020 at 13:36

@J.Schultke Nice - there's room for improvement in their std::gcd then. :-) – Discourteous 27/8, 2020 at 13:38

The inner loop can be made branch-free, although it probably doesn't matter if you have conditional moves: u = ((u + v) >> (u >= v)) - v; v = ((u + v) >> !(u >= v)) - u; – Hulking 27/8, 2020 at 20:39

Thanks to helpful commentators, I have found the crucial mistake: I should have used min instead of max

This is the final solution:

#include <algorithm>

constexpr unsigned gcd(unsigned u, unsigned v)
{
    if (u == v || u == 0 || v == 0)
        return u | v;

    // effectively compute min(ctz(u), ctz(v))
    unsigned shift = __builtin_ctz(u | v);
    u >>= __builtin_ctz(u);
    v >>= __builtin_ctz(v);

    const auto &[min, max] = std::minmax(u, v);

    return gcd(max - min, min) << shift;
}

int main() {
    constexpr unsigned g = gcd(25, 15); // g = 5
    return g;
}

This solution also has very nice, nearly branch-free compile output.

Here are some benchmark results of all the answers so far (we actually beat std::gcd):

Dent answered 26/8, 2020 at 20:26 Comment(11)

By the way you could take the minimum trailing zero count with __builtin_ctz(u | v) – Stites 26/8, 2020 at 20:28

Using max instead of min was a mistake. But it is fine to use the individual shifts. Because the extra factors of 2 you removed from one of them don't change the greatest COMMON divisor. – Centrosphere 26/8, 2020 at 20:41

If I take 2 odd numbers and multiply one by a power of 2, I have not changed the gcd. – Centrosphere 26/8, 2020 at 21:1

It's 'branch-free' only if you don't include the cost of recursive call overheads. As a tail-recursive algorithm, it's relatively easy to transform it into an iterative implementation. – Marble 26/8, 2020 at 21:7

Like @BrettHale's answer better. Compiler is better at improving this and recursion is rarely the fastest answer. – Prat 26/8, 2020 at 21:31

@J.Schultke - you don't need to do it, but since |u - v| is guaranteed to be even, an unconditional left shift by (1) is very fast. The tzcnt and older bsf instructions are not so quick, and the probability of (n) trailing zeroes is effectively (1/(2^n)). – Marble 26/8, 2020 at 21:56

BTW - it's probably better to print the result. For various and contentious reasons, a return code may only yield the low (8) bits; i.e., the godbolt result might effectively return: (g & 0xff) – Marble 26/8, 2020 at 22:15

Unrelated: You can get references to the min and max values like this: const auto&[min, max] = std::minmax(u, v); – Discourteous 27/8, 2020 at 6:14

Maybe my comment above wasn't unrelated after all (or maybe it was just a fluke) but using the above made it slightly faster when I tested (it went from 1.2 times faster than std::gcd to 1.3 times faster): quick-bench.com/q/Ee-lGBlYPwFQnfwSbCy2vLD9Wvs – Discourteous 27/8, 2020 at 13:42

@TedLyngmo that's because (surprisingly) std::minmax really does have different compile output. I would have expected clang to optimize this down to the same code, but it didn't. – Dent 27/8, 2020 at 14:3

@J.Schultke Yeah, I thought it would result in the same thing too :-) – Discourteous 27/8, 2020 at 14:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags