gcc and cpu_relax, smb_mb, etc.?

Asked 26/12, 2017 at 7:17 Answered 5/1, 2018 at 13:4

Solved gcc optimization memory-barriers barrier

I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.

One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.

However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.

Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:

while (my_variable != what_i_want) {}

and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:

while (my_variable != what_i_want)
    cpu_relax();

It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).

I have several gaps here:

1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?

2) Is the same true for other instructions such as smb_mb() and the likes?

3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?

4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?

Inattentive answered 26/12, 2017 at 7:17 Comment(0)

Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.

Filippo answered 27/12, 2017 at 4:49 Comment(2)

Thanks for the detailed answer! A clarification re point 4 (scope): in the example I gave above (cpu_relax in a busy loop), the document I linked to claims that this also causes gcc to treat the memory as volatile - i.e. not cache it in a register. How can I know what scope this applies to? I would assume that this doesn't disable register-caching across the entire function or compilation unit, but how can I know? – Inattentive 27/12, 2017 at 5:53

"Treat it as volatile" is somewhat misleading. All the compiler is required to do is store the value into memory before the fence, and reload it after the fence. It's free to keep the value in a register the rest of the time. Furthermore, if the compiler can prove the value is never transferred between threads, then it is free to keep it in a register all the time. For example, a link-time optimizer might detect when a threading library is never used, and eliminate all fencing behavior. – Filippo 27/12, 2017 at 16:5

There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.

The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.

The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.

Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).

If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.

I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....

Substantialism answered 5/1, 2018 at 13:4 Comment(7)

Thanks for the reply! Some followups: 2) my understanding was that smp_mb typically doesn't incur a cache flush - see e.g. this question. Intel's docs in MFENCE/LFENCE also do not mention a cache flush. 4) By "scope" I referred to the fact that cpu_relax acts as a compiler barrier, preventing gcc from moving stuff around. I was wondering what is the scope of this prevention: is it the block scope? Function scope? Or something else? – Inattentive 5/1, 2018 at 20:5

@YSK: You're correct. MFENCE does not do a cache flush. It is a pure hardware barrier that appears to guarantee all load/store operations are completed (globally visible) before the fence. But that implies (to me) a local cache write-through for store operations. – Substantialism 6/1, 2018 at 18:24

As for cpu_relax, I think I understand what you're asking and I don't have a definitive answer. I HOPE gcc -O4 would not aggressively change the behavior you coded but I don't know enough about the optimization algorithms. I know in your specific example above, the compiler will remove the empty while loop but will not when an operation is called, and cpu_relax is fast ( but barrier will work also.) Additionally, look at the x86 cmpxchg instruction for spinlocks - it is atomic where your comparison loop probably is not. Good luck! – Substantialism 6/1, 2018 at 18:39

As I understand it, the barrier instructions typically relate to write buffers and speculative reads (e.g. the recent Meltdown attack). I believe that caches are always coherent, so that a "flush" to main memory isn't needed. However, I've seen some references to delays in cache coherence updates which I don't really understand... – Inattentive 6/1, 2018 at 18:41

Good points. I'll say that I wrote a high-speed shared memory driver and needed to use MFENCE and LFENCE to handle changes in the shared memory lock variable. Once in a while one task would change it and then another wouldn't detect the change and also set it - so both tasks were writing to the shared memory area causing collisions. The fences fixed the problem. I also set the shared memory to be marked as "_uc" uncached and that seemed to work - but was noticeably slower. – Substantialism 6/1, 2018 at 21:3

I recently came across, and started working my way through, the Linux kernel documentation on memory barriers. It covers pretty much everything we've discussed, and then some. It's a long read but it's good. – Inattentive 7/1, 2018 at 6:30

Yeah, it's a good starting point. If you get deeper into the coding, you'll need the x86 software developers manuals. Look at Vol 3 Ch4 on paging and Ch8 on process management. – Substantialism 8/1, 2018 at 20:35

Recommended topics

Hot tags