I am investigating Intel CPU atomic features on my Haswell CPU (a 4/8 core 2.3-3.9ghz i7-4790M), and am finding it really hard to construct eg. reliable mutex_lock() and mutex_unlock() operations as suggested by for instance the GCC manual:
6.53 x86-Specific Memory Model Extensions for Transactional Memory
The x86 architecture supports additional memory ordering flags to mark lock critical sections for hardware lock elision. These must be specified in addition to an existing memory model to atomic intrinsics.
'__ATOMIC_HLE_ACQUIRE'
Start lock elision on a lock variable. Memory model must be
'__ATOMIC_ACQUIRE' or stronger.
'__ATOMIC_HLE_RELEASE'
End lock elision on a lock variable. Memory model must be
'__ATOMIC_RELEASE' or stronger.
When a lock acquire fails it is required for good performance to abort the transaction quickly. This can be done with a '_mm_pause'
#include <immintrin.h> // For _mm_pause
int lockvar;
/* Acquire lock with lock elision */
while (__atomic_exchange_n(&lockvar, 1,
__ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE))
_mm_pause(); /* Abort failed transaction */
...
/* Free lock with lock elision */
__atomic_store_n(&lockvar, 0, __ATOMIC_RELEASE|__ATOMIC_HLE_RELEASE);
So, reading that and the Intel Software Developer's Manual Vol.3 section 8.1, "Locked Atomic Operations", particulary section 8.1.4, "Effects of a LOCK Operation on Internal Processor Caches", led me to implement my test mutex_lock() mutex_unlock() at first like:
...
static inline attribute((always_inline,const))
bool ia64_has_clflush(void)
{ register unsigned int
ebx=0;
asm volatile
( "MOV $7, %%eax\n\t"
"MOV $0, %%ecx\n\t"
"CPUID\n\t"
"MOV %%ebx, %0\n\t"
: "=r" (ebx) :
: "%eax", "%ecx", "%ebx"
);
return ((ebx & (1U<<23)) ? true : false);
}
#define _LD_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ST_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ACQ_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_ACQUIRE)
#define _REL_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_RELEASE)
static bool has_clflush=false;
static
void init_has_clflush(void)
{ has_clflush = ia64_has_clflush();
}
static
void init_has_clflush(void) __attribute__((constructor));
static inline __attribute__((always_inline))
void mutex_lock( register _Atomic int *ua )
{ // the SDM states that memory to be used as semaphores
// should not be in the WB cache memory, but nearest we
// can get to uncached memory is to explicitly un-cache it:
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
// why isn't the cache flush enough?
else
asm volatile
( "LFENCE" :: );
register unsigned int x;
x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
_mm_pause();
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile
( "SFENCE" :: );
while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
case EAGAIN:
continue;
default:
fprintf(stderr,"Unexpected futex error: %d : '%s'.", errno,
strerror(errno));
return;
}
}
}
static inline __attribute__((always_inline))
void mutex_unlock( register _Atomic int *ua )
{ if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile( "LFENCE" :: );
register unsigned int x;
x = __atomic_add_fetch( ua, 1, _REL_SEQ_CST_);
_mm_pause();
if(has_clflush)
asm volatile
( "CLFLUSHOPT (%0)"
:: "r" (ua)
);
else
asm volatile ( "SFENCE" :: );
if(x == 0)
while( (1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1,
nullptr,nullptr,0)) && (errno == EINTR));
}
Now, what is interesting is that the critical mutex_lock() subtraction and mutex_unlock() addition operations end up as the instructions:
mutex_lock:
# 61 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L7:
lock xacquire subl $1, lck(%rip)
rep nop
cmpb $0, has_clflush(%rip)
je .L8
#APP
# 72 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
mutex_unlock:
#APP
# 98 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L24:
movl $1, %eax
lock xacquire xaddl %eax, lck(%rip)
rep nop
addl $1, %eax
cmpb $0, has_clflush(%rip)
je .L25
#APP
# 109 "intel_lock1.c" 1
CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
But this implementation seems to require the LFENCE / SFENCE to function reliably (CLFLUSHOPT is not enough) , otherwise both threads can end up deadlocked in futex() with the lock value being an identical -1 .
I cannot see from reading the intel documentation how it can happen that two threads entering the instruction sequence :
# %rbx == $lck
CLFLUSHOPT (%rbx)
lock xacquire subl $1, lck(%rip)
rep nop
can both end up with the result '-1' in *lck if *lck was 0 ; surely one thread MUST get -1 and the other -2 ?
But strace says not:
strace: Process 11978 attached with 2 threads
[pid 11979] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 11978] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL^C
this is the deadlock situation. Where did I go wrong ?
Please can any Intel CPU Locking & Caching experts out there explain how two atomic decrements or increments of the same uncached location *lck that both assert the #LOCK bus signal (exclusive bus access) and XACQUIRE can end up getting the same result in *lck?
I thought that was what the #LOCK prefix (and HLE) was meant to prevent ? I have tried NOT using HLE and just __ATOMIC_SEQ_CST for all accesses, (this just adds the LOCK prefix, not XACQUIRE) but it makes no difference - deadlock still results without the {L,S}FENCE-es.
I have read Ulrich Drepper's excellent paper [ Futexes are Tricky ] :http://www.akkadia.org/drepper/futex.pdf , but he presents a mutex implementation that only writes hard-coded constants to the lock memory . I can see why . It is very hard to get a mutex to work reliably with a waiter count or any kind of arithmetic done on the lock value. Has anyone found ways to do reliable locked arithmetic such that the result is suitable for lock / semaphore value on x86_64 Linux ? Most interested in discussing them ...
So after a few blind alleys investigating HLE & CLFLUSH, the ONLY working version of the lock / unlock I've been able to arrive at uses hard coded constants and __atomic_compare_exchange_n - the full source of the test program, which increments a counter (without locking) until + / an exit signal is received, is at:
Working Example: intel_lock3.c
[]:https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa
enum LockStatus
{ LOCKED_ONE_WAITER = -1
, LOCKED_NO_WAITERS = 0
, UNLOCKED=1
};
static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic int *ua )
{ register int x;
int cx;
lock_superceded:
x = __atomic_load_n( ua, _LD_SEQ_CST_ );
cx = x;
x = (x == UNLOCKED)
? LOCKED_NO_WAITERS
: LOCKED_ONE_WAITER;
if (! __atomic_compare_exchange_n
( ua, &cx, x, false, _ACQ_SEQ_CST_, _ACQ_SEQ_CST_) )
goto lock_superceded;
if( x == LOCKED_ONE_WAITER )
{ do{
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
x = __atomic_load_n(ua,_LD_SEQ_CST_);
} while(x < 0);
}
return true;
}
static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic int *ua )
{ register int x;
int cx;
unlock_superceded:
x = __atomic_load_n( ua, _LD_SEQ_CST_ );
cx = x;
x = (x == LOCKED_ONE_WAITER)
? LOCKED_NO_WAITERS
: UNLOCKED;
if (! __atomic_compare_exchange_n
( ua, &cx, x, false, _ACQ_SEQ_CST_, _ACQ_SEQ_CST_) )
goto unlock_superceded;
if(x == LOCKED_NO_WAITERS)
{ while((1 <
syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
||( UNLOCKED != __atomic_load_n( ua, _LD_SEQ_CST_ ))
) // we were a waiter, so wait for locker to unlock !
{ if( errno != 0 )
switch(errno)
{case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,
"Unexpected futex WAKE error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
}
return true;
}
Build & Test (GCC 7.3.1 & 6.4.1 & 5.4.0) used:
$ gcc -std=gnu11 -march=x86-64 -mtune=native -D_REENTRANT \
-pthread -Wall -Wextra -O3 -o intel_lock3 intel_lock3.c
$ ./intel_lock3
# wait a couple of seconds and press ^C
^C59362558
Broken Version Using Arithmetic:
https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws
Compile with eg:
$ gcc -std=gnu11 -march=x86_64 -mtune=native -O3 -Wall -Wextra
-o intel_lock2 intel_lock2.c
$ ./intel_lock2
# wait a couple of seconds and press ^C
$ ./intel_lock2
^Cwas locked!
446
It should not be printing "was locked!" and within a couple of seconds should have exceeded a count, printed at the end, of @ 5e8 : 5x10^8 , not 446.
Running with strace shows that two threads are blocking waiting for the lock value of -1 to become 0 :
$ strace -f -e trace=futex ./intel_lock2
strace: Process 14481 attached
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14480] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 14481] <... futex resumed> ) = 0
[pid 14480] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14480] <... futex resumed> ) = 0
[pid 14481] <... futex resumed> ) = -1 EAGAIN (Resource temporarily
unavailable)
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL^C <unfinished
...>
[pid 14480] <... futex resumed> ) = ? ERESTARTSYS (To be restarted
if SA_RESTART is set)
strace: Process 14480 detached
strace: Process 14481 detached
was locked!
7086
$
Normally, the WAIT should be scheduled before the WAKE, but somehow GCC is interpreting the memory ordering semantics to mean that the WAKE is always getting scheduled before any WAIT ; but even if that happens, the code should just get delayed, and should never end up with two threads getting a -1 lck value on entry to futex(...FUTEX_WAIT..).
The almost identical algorithm using arithmetic on the lock value ALWAYS deadlocks when both threads get (-1,-1) - note, a -2 value is never seen by any thread:
static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
{ register int x;
x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
if( x < 0 )
{ do{
// here you can put:
// if( x == -2) { .. NEVER REACHED! }
switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
{case 0:
break;
case -1:
switch( errno )
{ case EINTR:
return false; // interrupted - user wants to exit?
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
x = __atomic_load_n(ua,_LD_SEQ_);
} while(x < 0);
}
return true;
}
static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
{ register int x;
x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
if(x == 0) // there was ONE waiter
while( (1 <
syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0)
)
||(1 < __atomic_load_n(ua, _LD_SEQ_)
) // wait for first locker to unlock
)
{ if( errno != 0 )
switch(errno)
{case EINTR:
return false;
case EAGAIN:
break;
default:
fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.",
errno, strerror(errno));
return false;
}
}
return true;
}
So, I think if if the arithmetic operations were working as expected, ie. were serialized and atomic, then the above code would not deadlock; the arithmetic should be generating the same numbers as the LockStatus enum values used in the working example.
But something is going wrong with the arithmetic, which now produces the instructions :
mutex_lock:
movl $-1, %eax
lock xaddl %eax, (%rdx)
mutex_unlock:
movl $1, %eax
lock xaddl %eax, (%rdx)
The code itself inserts no fences, but each __atomic_store_n(ua,...) generates one .
AFAICS, there is no valid schedule of that code that results in both threads getting the same -1 value.
So my conclusion is that use of the intel LOCK prefix on arithmetic instructions is unsafe and introduces buggy behaviour in user-mode Linux x86_64 gcc compiled programs - only writes of constant values from text memory to data memory is atomic and sequentially ordered on Intel Haswell i7-4790M platforms with gcc & Linux, and arithmetic on such platforms cannot be made to be atomic & sequentially ordered by use of any combination of HLE / XACQUIRE, lock prefix, or FENCE instructions.
My hunch is that branch prediction is somehow failing and adding an extra arithmetic operation / failing to perform an arithmetic operation on this platform with the LOCK prefix asserted and multiple threads on different physical cores . Therefore, all arithmetic operations with the LOCK prefix asserted are suspect and should be avoided.
asm("lfence")
without a"memory"
clobber to stop the compiler from reordering memory operations across it isn't safe. Also,lfence
andsfence
have no effect on correctness if you aren't using NT stores or WC memory. If it happens to make your code work, it's just because of the extra delay. And BTW, thelock
prefix on an aligned address won't cause a bus lock, just a cache-lock of that line. IDK why you'd want to useclflushopt
on a lock, either. That will make it slow for no gain in correctness. The store buffer already makes operations visible ASAP. – Tangleia64_has_clflush(void)
is misnamed: IA64 is Itanium. 64-bit x86 is called x86-64. Or just call itx86_has_clflush
. Or better don't use it at all. – Tanglelock xacquire sub
, or justlock sub
? It's really unclear what you're asking about, because you're talking aboutlock
asserting the#LOCK
bus signal (which isn't even a thing for addresses in DRAM on CPUs with built-in memory controllers), but then you're also usingxacquire
. It's not surprising thatxacquire
completely changes the behaviour, because it's no longer actually doing an atomic operation, but more like attempt the transaction and then abort/commit (maybe like an LL/SC machine). RTM is easier to understand. – Tanglexacquire
. – Tanglelock
ed operations work efficiently on WB memory without write-back to DRAM at all, and HLE is also designed to work efficiently on user-space mutexes in WB memory. – Tanglefutex
system call? Are you asking about it, or are you asking aboutlock sub
? If I want a fallback to OS-assisted sleep/wake, I callpthread_mutex_lock
or use a C++ mutex. If I want to roll my own lockless code, I use C11 or C++11 atomics. I haven't had to deal with a case where I needed to fall back to an OS-assisted sleep/wake in a retry loop in a lockless algo. People have implemented such things, though, in code like liblfds: Lock-free Progress Guarantees – Tangleif()
blocks for instructions that should be irrelevant. A much simpler minimal reproducible example would be better, without any of the clflush or fence crap. – Tanglelock subl $1, (%rdi)
is 100% atomic even if the pointer is misaligned (but much slower in that case), and a full memory barrier. If your test is finding it isn't, your hardware is broken or your test is broken. – Tanglefutex
stuff - a pretty obscure call that few people are going to know. If I understand it correctly, you've already determined that the values are "wrong" before thefutex
call so why not just cut that part out? You are entitled to ask questions how you want, but response may be limited (GIGO). – Intervene