pthreads v. SSE weak memory ordering

About

Asked 15/6, 2014 at 20:45 Answered 8/8, 2014 at 16:30

Solved multithreading pthreads atomic sse memory-fences

Do the Linux glibc pthread functions on x86_64 act as fences for weakly-ordered memory accesses? (pthread_mutex_lock/unlock are the exact functions I'm interested in).

SSE2 provides some instructions with weak memory ordering (non-temporal stores such as movntps in particular). If you are using these instructions and want to guarantee that another thread/core sees an ordering, then I understand you need an explicit fence for this, e.g., a sfence instruction.

Normally you do expect the pthread API to act as a fence appropriately. However, I suspect normal C code on x86 will not generate weakly-ordered memory accesses, so I'm not confident that pthreads needs to act as a fence for weakly-ordered accesses.

Reading through the glibc pthread source code, a mutex is in the end implemented using "lock cmpxchgl", at least on the uncontended path. So I'm guessing that what I need to know is does that instruction act as a fence for SSE2 weakly-ordered accesses?

Worden answered 15/6, 2014 at 20:45 Comment(1)

Well no-one answered so I rolled up my sleeves and wrote a test program. Progress so far: ------ * With pthread spinlocks, I can easily demonstrate that they are not fences. * I have failed to produce pthread mutexes not acting as fences. ------ I am not sure if the failure is because the mutexes are fences or because I just got lucky. My test code is at gist.github.com/rcls/c855e3e782253e58e046 – Worden 15/7, 2014 at 6:36

Non-temporal stores need sfence instruction to be ordered properly.

However, the efficient user-level implementation of a simple mutex supposes that it is released by a simple write which does not imply write-buffers flush, in contrast to atomic read-modify-write operations like lock cmpxchg which imply full memory fence.

So you have a situation when the unlock has no effect of store-with-release semantic applied for non-temporal stores. Thus, these SSE stores can be reordered after the unlock and after another thread acquires the mutex.

Milurd answered 8/8, 2014 at 16:30 Comment(2)

Thanks, I had thought only about the lock operation and missed that the ordering semantics of the unlock after the weakly-ordered store is what is critical. – Worden 9/8, 2014 at 23:34

And re-reading the glibc sourcecode, this is consistent with the results of my testing. pthread_mutex_unlock, which appears to act as a fence, uses a locked instruction (via lowlevellock). pthread_spin_unlock uses a normal store. [Although the other difference is that pthread_spin_unlock is faster - presumably any sufficiently slow operations would in practice act a fence here, even though the architectural description does not guarentee it.] – Worden 15/8, 2014 at 8:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags