Why two modular operations cannot be optimized as well as one modular operation

I'm reading a C++ function that handles timestamps, and there's an expression like this:

t % (60 * 60) % 60

I thought this expression was strictly equivalent to:

t % 60

However, when I entered this code into https://godbolt.org, I found that gcc did not think so.

Information:

Source code and compilation results: https://godbolt.org/z/fnsxjanja
gcc version: x86-64 gcc 13.2 (with -O3)
clang version: x86-64 clang 17.0.1 (with -O3)
msvc version: x64 msvc v19.38 (with -O2 because there is no -O3)
icx version: x86-64 icx 2024.0.0 (with -O3)

The following C++ code:

#include <cstdint>
uint64_t mod3600mod60(uint64_t n) { return n % 3600 % 60; }
uint64_t mod60(uint64_t n) { return n % 60; }

will be compiled into:

mod3600mod60(unsigned long):
        movabs  rax, 655884233731895169
        mov     rdx, rdi
        shr     rdx, 4
        mul     rdx
        movabs  rax, -8608480567731124087
        shr     rdx, 3
        imul    rdx, rdx, 3600
        sub     rdi, rdx
        mul     rdi
        mov     rax, rdx
        shr     rax, 5
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 2
        sub     rax, rdx
        ret
mod60(unsigned long):
        movabs  rax, -8608480567731124087
        mul     rdi
        mov     rax, rdx
        shr     rax, 5
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 2
        sub     rax, rdx
        ret

I know that for dividing by a constant or modulo a constant, the compiler will optimize, but for two consecutive modulo, the compiler does not seem to optimize to the same as modulo once.

First, I tried to compile the above two functions with gcc, clang, msvc and icx respectively, and found that they would not optimize these two functions to the same result.

Next, I tried changing both moduli to powers of 2 and found that for the following two functions:

#include <cstdint>
uint64_t mod2048mod64(uint64_t n) { return n % 2048 % 64; }
uint64_t mod64(uint64_t n) { return n % 64; }

All four compilers will optimize the two to be exactly the same. Here is an example of the compilation result of gcc:

mod2048mod64(unsigned long):
        mov     rax, rdi
        and     eax, 63
        ret
mod64(unsigned long):
        mov     rax, rdi
        and     eax, 63
        ret

Next, I tried changing the first modulus to a non-power of 2, for the following function:

#include <cstdint>
uint64_t mod3200mod64(uint64_t n) { return n % 3200 % 64; }

All four compilers will still optimize exactly the same as a single % 64.

Next, I tried to change both modulus to a smaller non-power of 2 to confirm whether the reason for the lack of optimization was that the modulus was too large. For the following two functions:

#include <cstdint>
uint64_t mod6mod3(uint64_t n) { return n % 6 % 3; }
uint64_t mod3(uint64_t n) { return n % 3; }

As with the 3600 and 60, none of the four compilers can optimize these two functions to the same level. Here is the optimization result of gcc as an example:

mod6mod3(unsigned long):
        movabs  rcx, -6148914691236517205
        mov     rax, rdi
        mul     rcx
        shr     rdx, 2
        lea     rax, [rdx+rdx*2]
        add     rax, rax
        sub     rdi, rax
        mov     rax, rdi
        mul     rcx
        mov     rax, rdx
        and     rdx, -2
        shr     rax
        add     rdx, rax
        mov     rax, rdi
        sub     rax, rdx
        ret
mod3(unsigned long):
        movabs  rax, -6148914691236517205
        mul     rdi
        mov     rax, rdx
        and     rdx, -2
        shr     rax
        add     rdx, rax
        mov     rax, rdi
        sub     rax, rdx
        ret

Finally, I changed the parameters and return values of all the above functions to uint32_t and uint16_t respectively for compilation, and the results obtained were similar to uint64_t.

Summarized as follows:

For two operations on unsigned integers modulo a constant n % A % B, where A is a multiple of B:
- For the case where B is a power of 2, gcc, clang, msvc and icx can all optimize it to be the same as a single n % B;
- For the case where B is not a power of 2, gcc, clang, msvc and icx cannot optimize it to be the same as a single n % B.

Shouldn't n % A % B and n % B be strictly mathematically equivalent if A is a multiple of B?

This question is based on a fundamental misunderstanding. The title starts out with a false mathematical assumption - the two operations are equivalent. But compilers are simply not obliged to implement equivalent operations using identical code.

Hence, all the experiments prove very little. And with the next update of each compiler, these results could change. So the total lack of version numbers means that this small bit of information is not very useful.

I suppose part of the reason for the question is a misunderstanding how code optimization works. Every specific optimization you can think of has to be implemented in the compiler code, proven to be robust for all possible legal inputs, and not interfere with other optimizations. Is it worth doing this? Consider the concrete example again - that is not an argument for optimizing compilers, that is an argument for better programmers. If you mean x%60, then write x%60.

Note that the very large constants (e.g. -6148914691236517205) are already the result of an optimization. "Modulo" is essentially the remainder after division, but division is a very expensive operation. Hence, existing optimizations replace this by a multiply, with carefully tuned overflow operation (i.e. modulo 2^64), or indeed a bitmask operation for powers of two. With the %3600 %60 case, this ends up with two overflowing multiplies. Yes, those can also be merged, but this is way harder to prove correct. Optimizing two modulo operations in a row would need to happen before the operation is replaced with a multiply.

Recommended topics

Hot tags