Why does GCC fail to reduce a loop that increments two locations of the same buffer?

Asked 7/12, 2023 at 7:37 Answered 26/7, 2024 at 23:11

Here is a bounded loop that increments two locations of the same buffer.

unsigned int getid();
void foo(unsigned int *counter, unsigned int n) {
        unsigned int A = getid();
        unsigned int B = getid();
        for (unsigned int i = 0; i < n; i++) {
            ++counter[A];
            ++counter[B];
        }
}

According to the result from compiler explorer (https://godbolt.org/z/b1sjf5669), even at -O3 optimization level, the program still has a loop that involves add 1 command.

In contrast, if ++counter[B] is removed, the compiler can reduce the loop and optimize the program into counter[A] += n. (https://godbolt.org/z/4YoPcz5rT)

Why the compiler is conservative, i.e., not transforming the codes into counter[A] += n, counter[B] += n in this case?

I know that some alias issues would lead to the failure of loop reduction. However, there is only one buffer here and even A=B, optimizing the codes would not change the results.

Marleah answered 7/12, 2023 at 7:37 Comment(9)

Optimizations are a set of manually written rules for transformations, it might be that the rule for counter[A]+=n is not generic enough to catch this case. My uneducated guess would be that this rule requires no other access to counter[A] and maybe the compiler cannot "branch" and evaluate its decisions based on the relationship between A and B. – Enact 7/12, 2023 at 7:45

Yes, I think that optimization would valid; you could report it with the "missed-optimization" keyword on gcc.gnu.org/bugzilla . I don't think GCC is intentionally making sure all side-effects (even non-volatile ones) happen before a fault in case writing counter[B] faults or something, because that's not something GCC optimizations preserve in general. – Outbalance 7/12, 2023 at 9:6

clang has similar behavior. – Stethoscope 7/12, 2023 at 11:35

The compiler has no way to know if A and B are the same index. Granted, the compiler could check and branch, but that's not something they do these days. – Oxley 7/12, 2023 at 12:8

@fuz: It doesn't need to check if it does two separate memory-destination increments without overlapping their loads and stores. (That leaves it up to hardware to pipeline things if they're actually independent.) The OP's counter[A] += n, counter[B] += n; is correct in C terms even in the A==B case. (I'd have used ;, but , is a sequence point.) – Outbalance 7/12, 2023 at 12:11

GCC and Clang will optimize away the loop for two increments in some cases where it knows they're to different locations, like [A%4 + 0] and [A%4 + 16]. godbolt.org/z/PbeaW8nnE . (+16 is the next cache line, but it doesn't know how counter is aligned so even +1 could be into the next page. And it chooses to use one 64-bit SIMD increment if we use +1, so as we expect it's assuming no UB, not preserving fault behaviour.) But surprisingly if (A==B) return; doesn't help GCC or clang. Clang even unrolls but sticks to strictly alternating. – Outbalance 7/12, 2023 at 12:18

@PeterCordes That would be possible, too, but would require the compiler to understand that the increments commute. – Oxley 7/12, 2023 at 13:1

You don't need loops to see the problem, see gcc.gnu.org/bugzilla/show_bug.cgi?id=66261 for instance. Note that an increment actually appears as 3 operations internally: read, add, write, which makes it harder to express such commutativity (although of course it would still be possible). – Caprice 7/12, 2023 at 14:41

That’s why sometimes it makes sense to make local copies of anything that is going to undergo complex manipulation. – Lasser 26/7, 2024 at 22:26

Neither clang nor gcc is well equipped to handle situations where two lvalues would identify the same storage for some inputs, but may sometimes identify the same storage. Instead, they will usually ensure that any value which is written via one lvalue will be stored to memory before the other lvalue is used to read it from memory (yielding correct code that is often less efficient than would be possibile if separate code were used to handle the shared-storage and disjoint-storage cases) or occasionally make unsound assumptions about aliasing (which will yield code which is faster in cases where the assumptions end up being correct, but behave erroneously when they are not).

In your example, testing outside the loop whether A and B are equal and if they are not, passing counter+A and counter+B as restrict-qualified arguments to a static inline function that performs the repeated additions should yield the desired optimizations.

Note that clang and gcc limit the situations where the restrict qualifier may be used more tightly than the authors of the Standard either specified or presumably intended. The Standard's definition of "based upon" completely falls apart in situations involving equality comparisons between restrict-qualified pointers and other pointers that are not always based upon them, and the way clang and gcc treat such pointers does too. Both compilers are prone to assume that if pointer expressions p1 and p2 compare equal, the pointers may be used interchangeably, even if because of qualifiers they are not interchangeble. Thus, if one wants to use a separate loop for scenarios where pointers would be equal and those where the pointers would be unequal, it will be necesssary to perform the comparison on pointers that are not restrict qualified and then pass them to functions that take restrict-qualified pointers.

Themselves answered 26/7, 2024 at 23:11 Comment(0)

As was already said in the comments, increment is actually 3 operands: load, add, store.

Conservatively, compiler is not allowed to reorder MayAlias memory operations. To do what you want, the compiler must prove that the result of the computation does not depends on memory aliasing, which in general case would be non-trivial.

I think that the scope of this optimization is just too small for the implementation to be reasonable.

Fallal answered 4/3, 2024 at 18:28 Comment(0)

Recommended topics

Hot tags