memory barrier and atomic_t on linux

Asked 2/7, 2011 at 6:28 Answered 4/7, 2011 at 8:42

Solved linux multithreading concurrency atomic barrier

Recently, I am reading some Linux kernel space codes, I see this

uint64_t used;
uint64_t blocked;

used = atomic64_read(&g_variable->used);       //#1
barrier();                                     //#2
blocked = atomic64_read(&g_variable->blocked); //#3

What is the semantics of this code snippet? Does it make sure #1 executes before #3 by #2. But I am a litter bit confused, becasue

#A In 64 bit platform, atomic64_read macro is expanded to

used = (&g_variable->used)->counter           // where counter is volatile.

In 32 bits platform, it was converted to use lock cmpxchg8b. I assume these two have the same semantic, and for 64 bits version, I think it means:

all-or-nothing, we can exclude case where address is unaligned and word size large than CPU's native word size.
no optimization, force CPU read from memory location.

atomic64_read doesn't have semantic for preserve read ordering!!! see this

#B the barrier macro is defined as

/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")

From the wiki this just prevents gcc compiler from reordering read and write.

What i am confused is how does it disable reorder optimization for CPU? In addition, can i think barrier macro is full fence?

Madai answered 2/7, 2011 at 6:28 Comment(4)

Is it just me, or can this question be compressed to "How does this barrier() macro work?" ? – Inesita 2/7, 2011 at 6:34

I think it's important to take atomix... into account; that is -- is there any semantic differences when not using an atomic... access method? Does it depend upon memory model (strong vs. weak)? Does one or other other imply cache flushing? Etc, etc. – Flossie 2/7, 2011 at 6:39

@ptx, what is the the meaning of atomix? any reference – Madai 2/7, 2011 at 6:43

@Nicholas, perhaps, but that simple a question would probably be downvoted to oblivious for not showing research effort. – Cantina 2/7, 2011 at 6:44

32-bit x86 processors don't provide simple atomic read operations for 64-bit types. The only atomic operation on 64-bit types on such CPUs that deals with "normal" registers is LOCK CMPXCHG8B, which is why it is used here. The alternative is to use MOVQ and MMX/XMM registers, but that requires knowledge of the FPU state/registers, and requires that all operations on that value are done with the MMX/XMM instructions.

On 64-bit x86_64 processors, aligned reads of 64-bit types are atomic, and can be done with a MOV instruction, so only a plain read is required --- the use of volatile is just to ensure that the compiler actually does a read, and doesn't cache a previous value.

As for the read ordering, the inline assembler you quote ensures that the compiler emits the instructions in the right order, and this is all that is required on x86/x86_64 CPUs, provided the writes are correctly sequenced. LOCKed writes on x86 have a total ordering; plain MOV writes provide "causal consistency", so if thread A does x=1 then y=2, if thread B reads y==2 then a subsequent read of x will see x==1.

On IA-64, PowerPC, SPARC, and other processors with a more relaxed memory model there may well be more to atomic64_read() and barrier().

Brookins answered 4/7, 2011 at 8:42 Comment(8)

Not true: the Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: • Reading or writing a quadword aligned on a 64-bit boundary – Yazbak 4/7, 2011 at 8:52

LOCK is necessary for read-modify-write, such as atomic increment/decrement or CAS, not for reading, not for writing. As GJ pointed out, those are atomic on aligned quadwords (per se, not in combination) already. – Concatenate 4/7, 2011 at 9:10

@GJ: is that also true for non-Intel x86 CPUs? (I'm not saying that you're wrong, just that there might need to be consideration for non-Intel devices). – Kaplan 4/7, 2011 at 9:18

@GJ: Yes, you're right. However there are no instructions to do the load in 32-bit mode except CMPXCHG8B and the MOVQ instruction. The latter is an MMX/SSE instruction, and requires that you know the FPU state and register availability. – Brookins 4/7, 2011 at 9:20

@Anthony Williams: if cpu support XXM then you can use: movq xmm0, qword [Source]; movq qword [Destination], xmm0 what should have no influence to FPU state, it is walid also for AMD cpus. – Yazbak 4/7, 2011 at 13:47

@GJ: That requires that xmm0 is free, and doesn't work on old CPUs without XMM. It also requires a three-stage load to get [source] into EDX:EAX for processing: [source]->xmm0, xmm0->[temp], [temp]->EAX, [temp+4]->EDX. IMO, using CMPXCHG8B is simpler, but if you're using XMM anyway that's a valid choice. – Brookins 4/7, 2011 at 17:2

@Anthony Williams: yes cpu must support XMM and we can also load direct from xmm 128 bit register into 32 bit registers like: movq xmm0, qword [Source]; movd eax, xmm0; pshufd xmm0, xmm0, 1; movd edx, xmm0. On that way it is possible also 128 bit atomic read/write under 32 bit x86 mode. – Yazbak 4/7, 2011 at 17:49

@GJ: I'd forgotten that; I don't use XMM much. It's still an extra step, but might be worth doing. Thanks for the reminder! – Brookins 4/7, 2011 at 19:0

x86 CPUs don’t do read-after-read reordering, so it is sufficient to prevent the compiler from doing any reordering. On other platforms such as PowerPC, things will look a lot different.

Nahshun answered 4/7, 2011 at 6:25 Comment(0)

Recommended topics

Hot tags