Built-in mod ('%') vs custom mod function: improve the performance of modulus operation

Asked 25/10, 2015 at 18:27 Answered 25/6, 2020 at 21:46

Solved c++performance optimization modulo modular-arithmetic

Recently I came to know that the mod('%') operator is very slow. So I made a function which will work just like a%b. But is it faster than the mod operator?

Here's my function

int mod(int a, int b)
{
    int tmp = a/b;
    return a - (b*tmp);
}

Resnatron answered 25/10, 2015 at 18:27 Comment(11)

Do your own benchmarking, but you're not going to do better than the compiler. – Obryan 25/10, 2015 at 18:29

hmm both method works equally fast... even sometimes manual mod is faster! – Resnatron 25/10, 2015 at 18:41

'I came to know that the mod('%') operator is very slow' - really? How do you measure that, and what are you comparing against? – Tetchy 25/10, 2015 at 18:56

First, generate assembly language output and verify that there is no function call generated. You will loose the contest if the compiler emits assembly instructions without any branch and return instructions. – Mongolism 25/10, 2015 at 19:8

By the way, some processors have instructions that can return both the quotient and the remainder of a division with one instruction. – Mongolism 25/10, 2015 at 19:9

^^^^ it's one 'div' call on many processors. – Tetchy 25/10, 2015 at 20:26

Example x86: 'F7 /6, DIV r/m32, Unsigned divide EDX:EAX by r/m32, with result stored in EAX = Quotient, EDX = Remainder. – Tetchy 25/10, 2015 at 20:30

Or, for ints, :'F7 /7, IDIV r/m32 Signed divide EDX:EAX by r/m32, with result stored in EAX = Quotient, EDX = Remainder. – Tetchy 25/10, 2015 at 20:32

I compared against other mathematical operator such as +/-. This operator can not do what mod can but work very fast. @ Martin James – Resnatron 26/10, 2015 at 8:29

DCV - sorry another user ratted this out as being homework/competition entry and you are conning SO contributors. – Tetchy 26/10, 2015 at 9:10

Just because the processor can do it in one instruction (e.g. DIV) does not mean it executes in one clock cycle. division and modulo take more than clock cycle. Typically division takes 1 clock cycle per bit, x86 can do it a bit faster because of black magic, but it is still an order of magnitude slower than addition and subtraction. – Polynesia 4/12, 2022 at 20:25

According to Chandler Carruth's benchmarks at CppCon 2015, the fastest modulo operator (on x86, when compiled with Clang) is:

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input >= ceil ? input % ceil : input;
    // NB: the assumption here is that the numbers are positive
}

I suggest that you watch the whole talk, he goes into more details on why this method is faster than just using % unconditionally.

Primitive answered 25/10, 2015 at 18:52 Comment(5)

AFAIK Chandler Carruth works on compiler construction. I understand his talk, that he tries to explain, how compiler builders find and handle improved optimizations. Resisting the temptation to make your hand crufted "optimized" implementations of fundamental stuff will help you to profit from that work as soon as the compiler has it - and of course all the optimizations you did not come up with as well. Note: I did not say "don't optimize", but "resist micro optimizations". – Piscary 25/10, 2015 at 19:19

I haven't watched the video either, but fast_mod seems broken for negative input: -5 % 5 == 0 but fast_mod(-5, 5) == -5. – Risinger 26/10, 2015 at 9:34

Yes. As mentioned in the last comment: "NB: the assumption here is that the numbers are positive". (by "numbers", I meant input and ceil) – Primitive 26/10, 2015 at 9:42

Oh, sorry, I didn't see this. Still it seems worth pointing out that the optimization only works for a very special case. C's % is defined not only for int and not only for positive numbers. – Risinger 26/10, 2015 at 10:2

whether this is good or not depends on the distribution of values for input. I find it mostly good when one needs to have a wrapping counter. – Defamatory 21/1, 2020 at 14:47

This will likely be compiler and platform dependent.

But I was interested and on my system you appear to be correct in my benchmarks. However the method from @865719's answer is fastest:

#include <chrono>
#include <iostream>

class Timer
{
    using clk = std::chrono::steady_clock;
    using microseconds = std::chrono::microseconds;

    clk::time_point tsb;
    clk::time_point tse;

public:

    void clear() { tsb = tse = clk::now(); }
    void start() { tsb = clk::now(); }
    void stop() { tse = clk::now(); }

    friend std::ostream& operator<<(std::ostream& o, const Timer& timer)
    {
        return o << timer.secs();
    }

    // return time difference in seconds
    double secs() const
    {
        if(tse <= tsb)
            return 0.0;
        auto d = std::chrono::duration_cast<microseconds>(tse - tsb);
        return d.count() / 1000000.0;
    }
};

int mod(int a, int b)
{
    int tmp=a/b;
    return a-(b*tmp);
}

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input < ceil ? input : input % ceil;
    // NB: the assumption here is that the numbers are positive
}

int main()
{
    auto N = 1000000000U;
    unsigned sum = 0;

    Timer timer;

    for(auto times = 0U; times < 3; ++times)
    {
        std::cout << "     run: " << (times + 1) << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += n % (N - n);
        timer.stop();

        std::cout << "       %: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += mod(n, N - n);
        timer.stop();

        std::cout << "     mod: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += fast_mod(n, N - n);
        timer.stop();

        std::cout << "fast_mod: " << sum << " " << timer << "s" << '\n';
    }
}

Build: GCC 5.1.1 (x86_64)

g++ -std=c++14 -march=native -O3 -g0 ...

Output:

     run: 1
       %: 3081207628 5.49396s
     mod: 3081207628 4.30814s
fast_mod: 3081207628 2.51296s
     run: 2
       %: 3081207628 5.5522s
     mod: 3081207628 4.25427s
fast_mod: 3081207628 2.52364s
     run: 3
       %: 3081207628 5.4947s
     mod: 3081207628 4.29646s
fast_mod: 3081207628 2.56916s

Trichome answered 25/10, 2015 at 19:14 Comment(6)

What compiler flags did you use? – Jewett 25/10, 2015 at 19:44

@black I added build info to the answer. – Trichome 25/10, 2015 at 20:34

@Trichome My compiler can not compile your code. anyway can you use this new mod function and tell me the time please? int mod(int a, int b) { //int tmp=a/b; return a>b?a-(b*(a/b)):a; } – Resnatron 26/10, 2015 at 8:40

@Resnatron You need to tuen on C++11 to compile the code. In my tests your new function is about the same as fast_mod. – Trichome 26/10, 2015 at 11:9

This performance test might return very unprecise results. I would not trust them for making conclusions on how the calculations fare in real world examples. – Risinger 28/10, 2015 at 14:5

I tried both modulus and your function fast_mod function for factoring 1 trillion. The modulus operator averaged 1.9 seconds. The fast_mod function averaged 3.7 seconds. – Refresh 17/12, 2021 at 7:8

It is often possible for a programmer to beat the performance of the remainder operation in cases where a programmer knows things about the operands that the compiler doesn't. For example, if the base is likely to be a power of 2, but is not particularly likely to be larger than the value to be reduced, one could use something like:

unsigned mod(unsigned int x, unsigned int y)
{
  return y & (y-1) ? x % y : x & (y-1);
}

If the compiler expands the function in-line and the base is a constant power of 2, the compiler will replace the remainder operator with a bitwise AND, which is apt to be a major improvement. In cases where the base isn't a constant power of two, the generated code would need to do a little bit of computation before selecting whether to use the remainder operator, but in cases where the base happens to be a power of two the cost savings of the bitwise AND may exceed the cost of the conditional logic.

Another scenario where a custom modulus function may help is when the base is a fixed constant for which the compiler hasn't made provisions to compute the remainder. For example, if one wants to compute x % 65521 on a platform which can perform rapid integer shifts and multiplies, one may observe that computing x -= (x>>16)*65521; will cause x to be much smaller but will not affect the value of x % 65521. Doing the operation a second time will reduce x to the range 0..65745--small enough that a single conditional subtraction will yield the correct remainder.

Some compilers may know how to use such techniques to handle the % operator efficiently with a constant base, but for those that don't the approach can be a useful optimization, especially when dealing with numbers larger than a machine word [observe that 65521 is 65536-15, so on a 16-bit machine one could evaluate x as x = (x & 65535) + 15*(x >> 16). Not as readable as the form which subtracts 65521 * (x >> 16), but it's easy to see how it could be handled efficiently on a 16-bit machine.

Eppes answered 2/5, 2017 at 23:23 Comment(2)

Return type unsigned ? Never seen before by me and compilers. – Davedaveda 22/9, 2020 at 16:14

@AkibAzmain: Huh? It's equivalent to unsigned int, and I've seen it a lot more often than the longer form. – Eppes 22/9, 2020 at 17:28

Just contributing a little bit with this discussion. If you want to handle negative numbers, use the following function:

inline long long mod(const long long x, const long long y) {
    if (x >= y) {
        return x % y;
    } else if (x < 0) {
        return (x % y + y) % y;
    } else {
        return x;
    }
}

Horseshoe answered 25/6, 2020 at 21:46 Comment(1)

Beware of taking negative numbers (mod p). In many programming languages, ((−2)%5)!=(3%5). Thus you can compute the same hash values for two strings, but when you compare them, they appear to be different. To avoid this issue, you can use such construct in the code: x ← ((a%p)+p)%p instead of just x ← a%p – Horseshoe 25/6, 2020 at 22:33

Most of the time, your micro optimized code will not beat the compiler. I also don't know where that "wisdom" comes from, that claims the built in % to be slow. It is just as fast as the machine will be able to calculate it - with all the micro optimizations the compiler can do for you.

Also note, that performance measurements of such very small pieces of code is not an easy task. Conditionals of a loop construct or the jitter of your time measurement might dominate your results. You can find some talks on such issues by people like e.g. Andrei Alexantrescu, or Chandler Caruth on youtube. I have once written a micro benchmarking framework for a project I was working on. There is really a lot to care about, including external stuff like the OS preempting your thread, or moving it to another core.

Piscary answered 25/10, 2015 at 18:48 Comment(2)

Do you ever tried running it? I ran it on my pc several times and then asked this question. I dont know why mod(%) is slow but combination of subtraction(-) and division(/) works a very little bit faster. – Resnatron 26/10, 2015 at 8:32

@Resnatron well, then your compiler vendor should probably use your implementation instead of a naive one, or even better Chandlers. I'd expect the next version of clang to do so, because that is, what Chandler is working on. – Piscary 26/10, 2015 at 8:35

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags