Why two modular operations cannot be optimized as well as one modular operation
Asked Answered
R

1

6

I'm reading a C++ function that handles timestamps, and there's an expression like this:

t % (60 * 60) % 60

I thought this expression was strictly equivalent to:

t % 60

However, when I entered this code into https://godbolt.org, I found that gcc did not think so.

Information:

  • Source code and compilation results: https://godbolt.org/z/fnsxjanja
  • gcc version: x86-64 gcc 13.2 (with -O3)
  • clang version: x86-64 clang 17.0.1 (with -O3)
  • msvc version: x64 msvc v19.38 (with -O2 because there is no -O3)
  • icx version: x86-64 icx 2024.0.0 (with -O3)

The following C++ code:

#include <cstdint>
uint64_t mod3600mod60(uint64_t n) { return n % 3600 % 60; }
uint64_t mod60(uint64_t n) { return n % 60; } 

will be compiled into:

mod3600mod60(unsigned long):
        movabs  rax, 655884233731895169
        mov     rdx, rdi
        shr     rdx, 4
        mul     rdx
        movabs  rax, -8608480567731124087
        shr     rdx, 3
        imul    rdx, rdx, 3600
        sub     rdi, rdx
        mul     rdi
        mov     rax, rdx
        shr     rax, 5
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 2
        sub     rax, rdx
        ret
mod60(unsigned long):
        movabs  rax, -8608480567731124087
        mul     rdi
        mov     rax, rdx
        shr     rax, 5
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 2
        sub     rax, rdx
        ret

I know that for dividing by a constant or modulo a constant, the compiler will optimize, but for two consecutive modulo, the compiler does not seem to optimize to the same as modulo once.

First, I tried to compile the above two functions with gcc, clang, msvc and icx respectively, and found that they would not optimize these two functions to the same result.

Next, I tried changing both moduli to powers of 2 and found that for the following two functions:

#include <cstdint>
uint64_t mod2048mod64(uint64_t n) { return n % 2048 % 64; }
uint64_t mod64(uint64_t n) { return n % 64; } 

All four compilers will optimize the two to be exactly the same. Here is an example of the compilation result of gcc:

mod2048mod64(unsigned long):
        mov     rax, rdi
        and     eax, 63
        ret
mod64(unsigned long):
        mov     rax, rdi
        and     eax, 63
        ret

Next, I tried changing the first modulus to a non-power of 2, for the following function:

#include <cstdint>
uint64_t mod3200mod64(uint64_t n) { return n % 3200 % 64; }

All four compilers will still optimize exactly the same as a single % 64.

Next, I tried to change both modulus to a smaller non-power of 2 to confirm whether the reason for the lack of optimization was that the modulus was too large. For the following two functions:

#include <cstdint>
uint64_t mod6mod3(uint64_t n) { return n % 6 % 3; }
uint64_t mod3(uint64_t n) { return n % 3; } 

As with the 3600 and 60, none of the four compilers can optimize these two functions to the same level. Here is the optimization result of gcc as an example:

mod6mod3(unsigned long):
        movabs  rcx, -6148914691236517205
        mov     rax, rdi
        mul     rcx
        shr     rdx, 2
        lea     rax, [rdx+rdx*2]
        add     rax, rax
        sub     rdi, rax
        mov     rax, rdi
        mul     rcx
        mov     rax, rdx
        and     rdx, -2
        shr     rax
        add     rdx, rax
        mov     rax, rdi
        sub     rax, rdx
        ret
mod3(unsigned long):
        movabs  rax, -6148914691236517205
        mul     rdi
        mov     rax, rdx
        and     rdx, -2
        shr     rax
        add     rdx, rax
        mov     rax, rdi
        sub     rax, rdx
        ret

Finally, I changed the parameters and return values of all the above functions to uint32_t and uint16_t respectively for compilation, and the results obtained were similar to uint64_t.

Summarized as follows:

  • For two operations on unsigned integers modulo a constant n % A % B, where A is a multiple of B:
    • For the case where B is a power of 2, gcc, clang, msvc and icx can all optimize it to be the same as a single n % B;
    • For the case where B is not a power of 2, gcc, clang, msvc and icx cannot optimize it to be the same as a single n % B.

Shouldn't n % A % B and n % B be strictly mathematically equivalent if A is a multiple of B?

Ragsdale answered 4/3, 2024 at 15:14 Comment(4)
Are the computed values correct?Bod
You have done something wrong when using godbolt, since all compilers has nailed this problem: godbolt.org/z/c3Wbs68v5 next time share link to your experiment on compiler explorer.Besprent
@PeteBecker What I'm talking about is compiler optimization, not just whether the result is correct. If you just want the result to be correct, why not just -O0? I think an obviously established formula can be used to optimize.Ragsdale
@MarekR Thanks for the suggestion, godbolt sharing link has been added.Ragsdale
F
7

This question is based on a fundamental misunderstanding. The title starts out with a false mathematical assumption - the two operations are equivalent. But compilers are simply not obliged to implement equivalent operations using identical code.

Hence, all the experiments prove very little. And with the next update of each compiler, these results could change. So the total lack of version numbers means that this small bit of information is not very useful.

I suppose part of the reason for the question is a misunderstanding how code optimization works. Every specific optimization you can think of has to be implemented in the compiler code, proven to be robust for all possible legal inputs, and not interfere with other optimizations. Is it worth doing this? Consider the concrete example again - that is not an argument for optimizing compilers, that is an argument for better programmers. If you mean x%60, then write x%60.

Note that the very large constants (e.g. -6148914691236517205) are already the result of an optimization. "Modulo" is essentially the remainder after division, but division is a very expensive operation. Hence, existing optimizations replace this by a multiply, with carefully tuned overflow operation (i.e. modulo 2^64), or indeed a bitmask operation for powers of two. With the %3600 %60 case, this ends up with two overflowing multiplies. Yes, those can also be merged, but this is way harder to prove correct. Optimizing two modulo operations in a row would need to happen before the operation is replaced with a multiply.

Fulgurant answered 4/3, 2024 at 15:28 Comment(5)
The potential value of this optimization comes when other optimizations reduce to this logic after inlining, or when two semantically-different constants happen to be related and get used this way. Like int x = y % FOO % BAR; where FOO isn't required to be a multiple of BAR but does happen to be. This is probably a pretty rare situation, but an optimization pass to look for it might not add much compilation time, just checking abs(gcd(x,y)) == min(|x|,|y|) when two compile-time-constant remainders are chained (so looking for chained remainders is probably most of the cost in real code.)Mansized
It's reasonable to argue that it's not worth the maintenance cost to have such an optimization pass as part of GCC or Clang, even if it wasn't enabled by default so didn't cost compile time. Or maybe this does happen in enough real codebases that compiler devs would want to include it especially if someone had already taken the time to implement it, IDK.Mansized
std::gcd is constexpr so one can easily use it to implement this optimization manually in C++ if needed (by pre-computing the gcd in a constexpr variable and then dividing each value by the constexpr assuming the two RHS values are constexpr too -- which would be also required for the compiler to do such optimization anyway).Eurypterid
MSalters, Todays compilers are just not mod-ern enough. 😉Gourami
MSalters, Sorry for the lack of information when writing it the first time, the version number information has now been added. And after updating the compilation parameters of msvc, the results are similar to those of other compilers.Ragsdale

© 2022 - 2025 — McMap. All rights reserved.