Is it better to avoid using the mod operator when possible?
Asked Answered
H

9

70

I assume that calculating the modulus of a number is a somewhat expensive operation, at least compared to simple arithmetic tests (such as seeing if a number exceeds the length of an array). If this is indeed the case, is it more efficient to replace, for example, the following code:

res = array[(i + 1) % len];

with the following? :

res = array[(i + 1 == len) ? 0 : i + 1];

The first one is easier on the eyes, but I wonder if the second might be more efficient. If so, might I expect an optimizing compiler to replace the first snippet with the second when a compiled language is used?

Of course, this "optimization" (if it is indeed an optimization) doesn't work in all cases (in this case, it only works if i+1 is never more than len).

Horizontal answered 24/3, 2013 at 7:54 Comment(5)
This might be a case of missing the forest for the trees.Gilgamesh
if len is a compile-time constant a recent GCC compiler (with -02) is usually doing clever things, often avoiding the modulus machine instruction of the target processor.Fog
This really is the kind of optimization you should forget about. The optimizing compiler will do better than you could. What matters much more is the readability of your code.Fog
Or you could make the array 1 longer, and copy the first element into the new last element so you can access it normally. Any of these three options may be the fastest, depending on circumstances.Torrlow
This is usually used in circular queuesProtactinium
T
52

My general advice is as follows. Use whichever version you think is easier on the eye, and then profile your entire system. Only optimize those parts of the code that the profiler flags up as bottlenecks. I'll bet my bottom dollar that the modulo operator isn't going to be among them.

As far as the specific example goes, only benchmarking can tell which is faster on your specific architecture using your specific compiler. You are potentially replacing modulo with branching, and it's anything but obvious which would be faster.

Tarrant answered 24/3, 2013 at 7:57 Comment(4)
On recent machines integer arithmetic is nearly free; what matter much more is cache misses..... which are much more costly. an L1 cache miss stalls the processor for hundreds of cycles, during which the processor could do dozens of divisions or modulus; so the eventual cost of the modulus is noiseFog
@BasileStarynkevitch: Well, cache behaviour is going to be identical between the two code snippets. What's going to matter is whether or not version #2 uses branching and, if it does, how good a job the branch predictor is going to do.Tarrant
@Basile Starynkevitch I've seen a factor of about 300 between modulo vs accessing a big table on a laptop. (Adding a test for divisibility by 17 squared to avoid the array access was still beneficial.)Goblin
@Tarrant It might be worthwhile to add to this answer that the C language itself doesn't have speed; That's a quality of the implementation (eg. "your specific architecture"). In addition to being dependent upon hardware, "the speed of the modulo operator" is dependent upon the quality of the compiler building code for the hardware; A poor one might use the assembly equivalent of int foo = 54321; int bar = foo / 10000; foo -= bar * 10000; to obtain the modulo, while a good quality compiler might even optimise that code.Foumart
I
30

Some simple measurement:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    int test = atoi(argv[1]);
    int divisor = atoi(argv[2]);
    int iterations = atoi(argv[3]);

    int a = 0;

    if (test == 0) {
        for (int i = 0; i < iterations; i++)
            a = (a + 1) % divisor;
    } else if (test == 1) {
        for (int i = 0; i < iterations; i++)
            a = a + 1 == divisor ? 0 : a + 1;
    }

    printf("%d\n", a);
}

Compiling with either gcc or clang with -O3, and running time ./a.out 0 42 1000000000 (modulo version) or time ./a.out 1 42 1000000000 (comparison version) results in

  • 6.25 seconds user runtime for the modulo version,
  • 1.03 seconds for the comparison version.

(using gcc 5.2.1 or clang 3.6.2; Intel Core i5-4690K @ 3.50GHz; 64-bit Linux)

This means that it is probably a good idea to use the comparison version.

Infralapsarian answered 31/1, 2016 at 13:33 Comment(4)
On more realistic data (for example if number would be a random) then difference would not be that bigGerlach
The comparison version is only faster because the result of the if statement is the same every time, so the branch predictor gets it right every time. If you randomized the input, the comparison version might even be worse than modDrucill
@Drucill Hmm but the result of the if clause is the same for both tests all the time?Banna
He is referencing the (?) Operator, you're code depending on the size of the divisor is only guessing wrong 1 out of 100, 400, etcAltamirano
S
8

Well, have a look at 2 ways to get the next value of a "modulo 3" cyclic counter.

int next1(int n) {
    return (n + 1) % 3;
}

int next2(int n) {
    return n == 2 ? 0 : n + 1;
}

I've compiled it with gcc -O3 option (for the common x64 architecture), and -s to get the assembly code.

The code for the first function does some unexplainable magic (*) to avoid a division, using a multiplication anyway:

addl    $1, %edi
movl    $1431655766, %edx
movl    %edi, %eax
imull   %edx
movl    %edi, %eax
sarl    $31, %eax
subl    %eax, %edx
leal    (%rdx,%rdx,2), %eax
subl    %eax, %edi
movl    %edi, %eax
ret

And is much longer (and I bet slower) than the second function:

leal    1(%rdi), %eax
cmpl    $2, %edi
movl    $0, %edx
cmove   %edx, %eax
ret

So it is not always true that "the (modern) compiler does a better job than you anyway".

Interestingly, the same experiment with 4 instead of 3 leads to a and-masking for the first function

addl    $1, %edi
movl    %edi, %edx
sarl    $31, %edx
shrl    $30, %edx
leal    (%rdi,%rdx), %eax
andl    $3, %eax
subl    %edx, %eax
ret

but it is still, and by large, inferior to the second version.

Being more explicit about proper ways to do the things

int next3(int n) {
    return (n + 1) & 3;;
}

yields much better results :

leal    1(%rdi), %eax
andl    $3, %eax
ret

(*) well, not that complicated. Multiplication by reciprocical. Compute the integer constant K = (2^N)/3, for some large enough value of N. Now, when you want the value of X/3, instead of a division by 3, compute X*K, and shift it N positions to the right.

Stila answered 24/9, 2018 at 15:5 Comment(6)
The comparison in the second version might make it inferior to the first version; if it doesn't regularly predict the correct branch, that's going to screw up the pipeline. Still, +1 for demonstrating that modern compilers do not always magically find the best possible machine code.Uranyl
@Uranyl as far as I understand, conditional move have been added to the instruction set (Pentium Pro) so no branch prediction takes place at all "The CMOVcc instructions are useful for optimizing small IF constructions. They also help eliminate branching overhead for IF statements and the possibility of branch mispredictions by the processor. " Pentium-Pro Family Developers Manual, vol 2, p 6.14. phatcode.net/res/231/files/24269101.pdfStila
Michel Billaud: Looks like you're right. I saw the cmpl and completely overlooked the lack of a jump.Uranyl
The % 4 code is more complex because you are doing signed arithmetic. According to C99, the sign of the modulus must match the sign of the dividend, so it is not just a straight bitwise AND. Change the type to unsigned int, and you will get the same result as the & 3 code.Robynroc
-1 because the answer strongly suggests judging by code size, which is an okay heuristic but a mistake when it comes to optimizations like the one in this question. Not all instructions are equal. Even on a RISC architecture some operations might take longer than others, and on a pipelined CPU the time spent executing a mispredicted branch (which is longer than the branch itself, but continues after the branch until the result of the branching condition is found deeper in the pipeline) might be longer than the time spent by the unconditional code with more instructions.Emmettemmey
next1 and next2 are not comparable, because first supports input from the whole integer range and other works correctly only when input is in range 0..2. So if compiler optimized the next1 to be like next2, it would change the behaviour.Elbring
A
2

If 'len' in your code is big enough, then the conditional will be faster, as the branch predictors will nearly always guess correctly.

If not, then I believe this is closely connected to circular queues, where it is often the case that the length is a power of 2. This will enable the compiler to substitute modulo with a simple AND.

The code is the following:

#include <stdio.h>
#include <stdlib.h>

#define modulo

int main()
{
    int iterations = 1000000000;
    int size = 16;
    int a[size];
    unsigned long long res = 0;
    int i, j;

    for (i=0;i<size;i++)
        a[i] = i;

    for (i=0,j=0;i<iterations;i++)
    {
        j++;
        #ifdef modulo
            j %= size;
        #else
            if (j >= size)
                j = 0;
        #endif
        res += a[j];
    }

    printf("%llu\n", res);
}

size=15:

  • modulo: 4,868s
  • cond: 1,291s

size=16:

  • modulo: 1,067s
  • cond: 1,599s

Compiled in gcc 7.3.0 , with -O3 optimization. The machine is an i7 920.

Armandoarmature answered 17/4, 2019 at 14:12 Comment(3)
I wonder why the cond version's time is not the same in both cases.Cruzcruzado
I think that, because res is not volatile, gcc can do many optimizations that are less effective as the size is increasing. When I add 'volatile' to res the times for the conditional are always around 2 sec. For modulo around 2 sec when power of 2 and not stable (above 4 sec, increasing with the size) otherwise.Armandoarmature
I also noticed that in the case of non-volatile res, for 1024 size the conditional runs faster, in 1 sec, so I guess it's about 'good' and 'bad' sizes for the optimizations (or branch predictors?).Armandoarmature
S
2

Here is some additional benchmark. Note that I also added a branchless version:

#include <iostream>
#include <array>
#include <algorithm>
#include <random>
#include <chrono>
using namespace std::chrono;

constexpr size_t iter = 1e8;

int main() {
  std::minstd_rand rnd_engine{1234};
  std::uniform_int_distribution<int> dist {-1000, 1000};
  auto gen = [&]() { return dist(rnd_engine); };

  std::array<int, 10> a;
  std::generate( a.begin(), a.end(), gen);

  for (size_t size = 2; size < 10; size++) {
    std::cout << "Modulus size = " << size << '\n';
  
    {
      std::cout << "operator%  ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = (x + 1) % size;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
  
    {
      std::cout << "ternary    ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = ((x + 1) == size) ? 0 : x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
    
    {
      std::cout << "branchless ";
      long sum = 0;
      size_t x = 1;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x-1];
        x = ( x != size ) * x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }

  }
  return 0;
}

And here is the output on my i7-4870HQ

$ g++ -Ofast test.cpp && ./a.out
Modulus size = 2
operator%  904.249ms    (sum = -4200000000)
ternary    137.04ms     (sum = -4200000000)
branchless 169.182ms    (sum = -4200000000)
Modulus size = 3
operator%  914.911ms    (sum = -31533333963)
ternary    113.384ms    (sum = -31533333963)
branchless 167.614ms    (sum = -31533333963)
Modulus size = 4
operator%  877.3ms      (sum = -36250000000)
ternary    97.265ms     (sum = -36250000000)
branchless 167.215ms    (sum = -36250000000)
Modulus size = 5
operator%  891.295ms    (sum = -30700000000)
ternary    88.562ms     (sum = -30700000000)
branchless 167.087ms    (sum = -30700000000)
Modulus size = 6
operator%  903.644ms    (sum = -39683333196)
ternary    83.433ms     (sum = -39683333196)
branchless 167.778ms    (sum = -39683333196)
Modulus size = 7
operator%  908.096ms    (sum = -34585713678)
ternary    79.703ms     (sum = -34585713678)
branchless 166.849ms    (sum = -34585713678)
Modulus size = 8
operator%  869ms        (sum = -39212500000)
ternary    76.972ms     (sum = -39212500000)
branchless 167.29ms     (sum = -39212500000)
Modulus size = 9
operator%  875.003ms    (sum = -36500000580)
ternary    75.011ms     (sum = -36500000580)
branchless 172.356ms    (sum = -36500000580)

In this particular case the ternary operator looks far superior, and it becomes even more like so when the branch predictor ramps up. Note however that this is a very particular case: if we were not incrementing the index by non-const value, using the more general operator% would be straightforward while the other two methods could become very intricated.

I would like to stress this very much underrated comment:

if len is a compile-time constant a recent GCC compiler (with -02) is usually doing clever things, often avoiding the modulus machine instruction of the target processor.Basile Starynkevitch

For instance by removing the loop on the size variable and declaring it as const size_t size = 4; I get:

g++ -Ofast test.cpp && ./a.out
Modulus size = 4
operator%  62.103ms     (sum = -36250000000)
ternary    93.674ms     (sum = -36250000000)
branchless 166.774ms    (sum = -36250000000)

Conclusions

The execution time of the branchless version is pretty stable across the various scenarios. The ternary is consistently better than the branchless over the considered cases, especially when the branch predictor kicks in. Finally, the operator%, while being more general and significantly slower, has chances to get optimized to become the fastest as in the case of particular const values of the right hand side.

Of course this is completely platform dependent, who knows how this will be on an Arduino :)

Sevilla answered 20/8, 2020 at 19:36 Comment(0)
F
0

I read article on making a fast hash map. A bottle neck can be the modulus operator to find the hash bucket. They suggested to make your number of buckets a power of 2. Apparently doing modulus by power of two means just like looking at last n bits.

Flita answered 5/9, 2020 at 23:8 Comment(0)
G
0

Modulo operator is expensive but the division is expensive too. So converting your code from using modulo operator to division is not going to optimize your code.

  (i + 1) % len

To optimize the above code

if ((i+1)==len){
   return 0
} else {
   return i+1
}
Goodish answered 31/1, 2022 at 2:21 Comment(1)
I think you mistook the ternary operator for division. The author's own optimization is equivalent to the one you suggested.Agogue
K
0

Yes, it will be noticably slower if you're calling the modulus operator % in a loop. If you just call it a few times maximum, I wouldn't worry about it.

Even in release mode with Visual Studio 2022 my profiler flagged the if (value % alignment != 0) statement by ~10% CPU usage which was quite hefty. I was able to optimize it significantly by using a simple bitwise & instead since my alignments were always powers of 2:

if ((value & (alignment - 1)) != 0)

The optimizing compiler cannot be sure that my alignment is always a power of 2 so the optimization was not allowed to be done implicitly by the compiler. This is one of those examples where compilers are not able to help you out automatically.

Now my profiler only showed a CPU usage of 0.3% for that if statement which is way more efficient than before when using % directly.

Karlie answered 11/9, 2023 at 17:2 Comment(0)
S
-4

Modulo can be done with a single processor instruction on most architectures (ex. DIV on x86). However it's likely a premature optimization for what you need.

Shortterm answered 24/3, 2013 at 8:2 Comment(7)
Just becuase there is a single instruction for an operation, doesn't mean it occurs in a single clock cycle.Felizio
@ChrisDesjardins Agreed, but % if the second operator is power of 2 can be represented as a bit mask.Apostrophize
Sorry had to downvote. I have worked with a lot of architectures (but not x86) and have yet to work with one that accomplishes mod/div in one instruction. And I have seen apps where mod is one of the top 10 CPU consuming function calls because of all of the circular buffering - each "sample" copy followed by a % buffersize. In my case I try to avoid mod if I can - usually by asserting that input buffer sizes are divisible by 2 so the compiler can optimize away the mod.Rowenarowland
@Rowenarowland good point. branch prediction works well for ring buffers on iterations. one may think branching is more expensive than modulo and probably missed the opportunity to use it.Jobber
div is the slowest integer ALU operation by far. Like 35 to 90 cycle latency on Skylake for div r64, vs. 3 cycle latency for imul r64, r64. Related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why?/ shows just how slow div is, especially vs. a shift for a power of 2.Lennie
@Rowenarowland the compiler cannot optimize away a mod if the operand is divisible by 2. Did you mean a power of 2?Cantrell
@CaptainCodeman, yes, I meant power of 2. Good catch 7 years later :)Rowenarowland

© 2022 - 2024 — McMap. All rights reserved.