Intel-64 and ia32 atomic operations acquire-release semantics and GCC 5+
Asked Answered
W

2

1

I am investigating Intel CPU atomic features on my Haswell CPU (a 4/8 core 2.3-3.9ghz i7-4790M), and am finding it really hard to construct eg. reliable mutex_lock() and mutex_unlock() operations as suggested by for instance the GCC manual:

6.53 x86-Specific Memory Model Extensions for Transactional Memory

The x86 architecture supports additional memory ordering flags to mark lock critical sections for hardware lock elision. These must be specified in addition to an existing memory model to atomic intrinsics.

 '__ATOMIC_HLE_ACQUIRE'
 Start lock elision on a lock variable.  Memory model must be
 '__ATOMIC_ACQUIRE' or stronger.
 '__ATOMIC_HLE_RELEASE'
 End lock elision on a lock variable.  Memory model must be
 '__ATOMIC_RELEASE' or stronger.

When a lock acquire fails it is required for good performance to abort the transaction quickly. This can be done with a '_mm_pause'

 #include <immintrin.h> // For _mm_pause

 int lockvar;

 /* Acquire lock with lock elision */
 while (__atomic_exchange_n(&lockvar, 1, 
     __ATOMIC_ACQUIRE|__ATOMIC_HLE_ACQUIRE))
     _mm_pause(); /* Abort failed transaction */
 ...
 /* Free lock with lock elision */
 __atomic_store_n(&lockvar, 0, __ATOMIC_RELEASE|__ATOMIC_HLE_RELEASE);

So, reading that and the Intel Software Developer's Manual Vol.3 section 8.1, "Locked Atomic Operations", particulary section 8.1.4, "Effects of a LOCK Operation on Internal Processor Caches", led me to implement my test mutex_lock() mutex_unlock() at first like:

... static inline attribute((always_inline,const)) bool ia64_has_clflush(void) { register unsigned int ebx=0; asm volatile ( "MOV $7, %%eax\n\t" "MOV $0, %%ecx\n\t" "CPUID\n\t" "MOV %%ebx, %0\n\t" : "=r" (ebx) : : "%eax", "%ecx", "%ebx" ); return ((ebx & (1U<<23)) ? true : false); }

#define _LD_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ST_SEQ_CST_ __ATOMIC_SEQ_CST
#define _ACQ_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_ACQUIRE)
#define _REL_SEQ_CST_ (__ATOMIC_SEQ_CST|__ATOMIC_HLE_RELEASE)

static bool has_clflush=false;
static
void init_has_clflush(void)
{ has_clflush = ia64_has_clflush();
}
static
void init_has_clflush(void) __attribute__((constructor));

static inline __attribute__((always_inline))
void mutex_lock( register _Atomic int *ua )
{ // the SDM states that memory to be used as semaphores
  // should not be in the WB cache memory, but nearest we
  // can get to uncached memory is to explicitly un-cache it:
  if(has_clflush)
    asm volatile
    ( "CLFLUSHOPT (%0)"
      :: "r" (ua)
    );
    // why isn't the cache flush enough?
    else
      asm volatile
      ( "LFENCE" :: );
      register unsigned int x;
      x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);
      _mm_pause();
    if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
       :: "r" (ua)
      );
    else
      asm volatile
      ( "SFENCE" :: );
  while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
        case EAGAIN:
         continue;
        default:
         fprintf(stderr,"Unexpected futex error: %d : '%s'.", errno,   
              strerror(errno));
        return;
      }
    }
  }

  static inline __attribute__((always_inline))
  void mutex_unlock( register _Atomic int *ua )
  { if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
      :: "r" (ua)
      );
    else
      asm volatile( "LFENCE" :: );
    register unsigned int x;
    x = __atomic_add_fetch( ua, 1, _REL_SEQ_CST_);
    _mm_pause();
    if(has_clflush)
      asm volatile
      ( "CLFLUSHOPT (%0)"
        :: "r" (ua)
      );
    else
      asm volatile ( "SFENCE" :: );
    if(x == 0)
      while( (1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1,
           nullptr,nullptr,0)) && (errno == EINTR));
  }

Now, what is interesting is that the critical mutex_lock() subtraction and mutex_unlock() addition operations end up as the instructions:

mutex_lock:

# 61 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L7:
    lock xacquire subl  $1, lck(%rip)
    rep nop
    cmpb    $0, has_clflush(%rip)
    je  .L8
#APP
# 72 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2

mutex_unlock:

#APP
# 98 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP
.L24:
    movl    $1, %eax
    lock xacquire xaddl %eax, lck(%rip)
    rep nop
    addl    $1, %eax
    cmpb    $0, has_clflush(%rip)
    je  .L25
#APP
# 109 "intel_lock1.c" 1
    CLFLUSHOPT (%rbx)
# 0 "" 2
#NO_APP

But this implementation seems to require the LFENCE / SFENCE to function reliably (CLFLUSHOPT is not enough) , otherwise both threads can end up deadlocked in futex() with the lock value being an identical -1 .

I cannot see from reading the intel documentation how it can happen that two threads entering the instruction sequence :

# %rbx == $lck
CLFLUSHOPT (%rbx)
lock xacquire subl  $1, lck(%rip)
rep nop

can both end up with the result '-1' in *lck if *lck was 0 ; surely one thread MUST get -1 and the other -2 ?

But strace says not:

strace: Process 11978 attached with 2 threads
[pid 11979] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 11978] futex(0x60209c, FUTEX_WAIT, 4294967295, NULL^C

this is the deadlock situation. Where did I go wrong ?

Please can any Intel CPU Locking & Caching experts out there explain how two atomic decrements or increments of the same uncached location *lck that both assert the #LOCK bus signal (exclusive bus access) and XACQUIRE can end up getting the same result in *lck?

I thought that was what the #LOCK prefix (and HLE) was meant to prevent ? I have tried NOT using HLE and just __ATOMIC_SEQ_CST for all accesses, (this just adds the LOCK prefix, not XACQUIRE) but it makes no difference - deadlock still results without the {L,S}FENCE-es.

I have read Ulrich Drepper's excellent paper [ Futexes are Tricky ] :http://www.akkadia.org/drepper/futex.pdf , but he presents a mutex implementation that only writes hard-coded constants to the lock memory . I can see why . It is very hard to get a mutex to work reliably with a waiter count or any kind of arithmetic done on the lock value. Has anyone found ways to do reliable locked arithmetic such that the result is suitable for lock / semaphore value on x86_64 Linux ? Most interested in discussing them ...

So after a few blind alleys investigating HLE & CLFLUSH, the ONLY working version of the lock / unlock I've been able to arrive at uses hard coded constants and __atomic_compare_exchange_n - the full source of the test program, which increments a counter (without locking) until + / an exit signal is received, is at:

Working Example: intel_lock3.c

[]:https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa

enum LockStatus
{ LOCKED_ONE_WAITER = -1
, LOCKED_NO_WAITERS = 0
, UNLOCKED=1
};

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic int *ua )
{ register int x;
  int cx;
 lock_superceded:
  x  = __atomic_load_n( ua, _LD_SEQ_CST_ );
  cx = x;
  x = (x == UNLOCKED)
       ? LOCKED_NO_WAITERS
       : LOCKED_ONE_WAITER;
  if (! __atomic_compare_exchange_n
      ( ua, &cx, x, false, _ACQ_SEQ_CST_,  _ACQ_SEQ_CST_) )
    goto lock_superceded;
  if( x == LOCKED_ONE_WAITER )
  { do{
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false;
        case EAGAIN:
          break;
        default:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_CST_);
    } while(x < 0);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic int *ua )
{ register int x;
  int cx;
 unlock_superceded:
  x  = __atomic_load_n( ua, _LD_SEQ_CST_ );
  cx = x;
  x = (x == LOCKED_ONE_WAITER)
       ? LOCKED_NO_WAITERS
       : UNLOCKED;
  if (! __atomic_compare_exchange_n
       ( ua, &cx, x, false, _ACQ_SEQ_CST_,  _ACQ_SEQ_CST_) )
    goto unlock_superceded;
    if(x == LOCKED_NO_WAITERS)
    { while((1 < 
             syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
         ||( UNLOCKED != __atomic_load_n( ua, _LD_SEQ_CST_ ))
         ) // we were a waiter, so wait for locker to unlock !
      { if( errno != 0 )
          switch(errno)
          {case EINTR:
            return false;
           case EAGAIN:
            break;
           default:
            fprintf(stderr,
                  "Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
            return false;
          }
      }
   }
   return true;
 }

 Build & Test (GCC 7.3.1 & 6.4.1 & 5.4.0) used:
 $ gcc -std=gnu11 -march=x86-64 -mtune=native -D_REENTRANT \
   -pthread -Wall -Wextra -O3 -o intel_lock3 intel_lock3.c

 $ ./intel_lock3
 # wait a couple of seconds and press ^C
 ^C59362558

Broken Version Using Arithmetic:

https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws

Compile with eg:

$ gcc -std=gnu11 -march=x86_64 -mtune=native -O3 -Wall -Wextra 
  -o intel_lock2 intel_lock2.c
$ ./intel_lock2
# wait a couple of seconds and press ^C
$ ./intel_lock2
^Cwas locked!
446

It should not be printing "was locked!" and within a couple of seconds should have exceeded a count, printed at the end, of @ 5e8 : 5x10^8 , not 446.

Running with strace shows that two threads are blocking waiting for the lock value of -1 to become 0 :

$ strace -f -e trace=futex ./intel_lock2
strace: Process 14481 attached
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14480] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
                                          unavailable)
[pid 14481] <... futex resumed> )       = 0
[pid 14480] futex(0x602098, FUTEX_WAKE, 1 <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14480] <... futex resumed> )       = 0
[pid 14481] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
                                          unavailable)
[pid 14480] futex(0x602098, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
[pid 14481] futex(0x602098, FUTEX_WAIT, 4294967295, NULL^C <unfinished  
...>
[pid 14480] <... futex resumed> )       = ? ERESTARTSYS (To be restarted 
if SA_RESTART is set)
strace: Process 14480 detached
strace: Process 14481 detached
was locked!
7086

$

Normally, the WAIT should be scheduled before the WAKE, but somehow GCC is interpreting the memory ordering semantics to mean that the WAKE is always getting scheduled before any WAIT ; but even if that happens, the code should just get delayed, and should never end up with two threads getting a -1 lck value on entry to futex(...FUTEX_WAIT..).

The almost identical algorithm using arithmetic on the lock value ALWAYS deadlocks when both threads get (-1,-1) - note, a -2 value is never seen by any thread:

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
{ register int x;
  x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
  if( x < 0 )
  { do{
    // here you can put:
    // if( x == -2) { .. NEVER REACHED! }
    switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false; // interrupted - user wants to exit?
        case EAGAIN:
          break;
        default:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_);
    } while(x < 0);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
{ register int x;
  x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
  if(x == 0) // there was ONE waiter
     while(  (1 < 
             syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0)
             )
           ||(1 < __atomic_load_n(ua, _LD_SEQ_)
             ) // wait for first locker to unlock
           ) 
     { if( errno != 0 )
         switch(errno)
         {case EINTR:
           return false;
          case EAGAIN:
           break;
          default:
           fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
           return false;
         }
       }
     return true;
   }

So, I think if if the arithmetic operations were working as expected, ie. were serialized and atomic, then the above code would not deadlock; the arithmetic should be generating the same numbers as the LockStatus enum values used in the working example.

But something is going wrong with the arithmetic, which now produces the instructions :

mutex_lock:

movl    $-1, %eax
lock xaddl  %eax, (%rdx)

mutex_unlock:

movl    $1, %eax
lock xaddl  %eax, (%rdx)

The code itself inserts no fences, but each __atomic_store_n(ua,...) generates one .

AFAICS, there is no valid schedule of that code that results in both threads getting the same -1 value.

So my conclusion is that use of the intel LOCK prefix on arithmetic instructions is unsafe and introduces buggy behaviour in user-mode Linux x86_64 gcc compiled programs - only writes of constant values from text memory to data memory is atomic and sequentially ordered on Intel Haswell i7-4790M platforms with gcc & Linux, and arithmetic on such platforms cannot be made to be atomic & sequentially ordered by use of any combination of HLE / XACQUIRE, lock prefix, or FENCE instructions.

My hunch is that branch prediction is somehow failing and adding an extra arithmetic operation / failing to perform an arithmetic operation on this platform with the LOCK prefix asserted and multiple threads on different physical cores . Therefore, all arithmetic operations with the LOCK prefix asserted are suspect and should be avoided.

Wheelchair answered 2/6, 2018 at 14:14 Comment(23)
asm("lfence") without a "memory" clobber to stop the compiler from reordering memory operations across it isn't safe. Also, lfence and sfence have no effect on correctness if you aren't using NT stores or WC memory. If it happens to make your code work, it's just because of the extra delay. And BTW, the lock prefix on an aligned address won't cause a bus lock, just a cache-lock of that line. IDK why you'd want to use clflushopt on a lock, either. That will make it slow for no gain in correctness. The store buffer already makes operations visible ASAP.Tangle
ia64_has_clflush(void) is misnamed: IA64 is Itanium. 64-bit x86 is called x86-64. Or just call it x86_has_clflush. Or better don't use it at all.Tangle
Aha! Thank you Peter ! That was probably it. So one MUST use WC memory to have any hope of true atomicity of arithmetic operations with the lock prefix?Wheelchair
WTF? No, atomic ops work perfectly well on WB memory, and are most efficient that way. Cache is coherent, so atomically updating a line in L1d makes the update atomic to all other observers in the system. (i.e. all other cores). Can num++ be atomic for 'int num'?. See C & low-level semaphore implementation for a counting semaphore using C11 atomics. (Without using the futex system call, just a pure userspace implementation with no fallback to an OS sleep/wait, but shows how atomics work)Tangle
Thanks, I will review those posts. So there is no way mapping true WC memory into userspace linux process ?Wheelchair
I haven't done very much with HLE. Are you asking about lock xacquire sub, or just lock sub? It's really unclear what you're asking about, because you're talking about lock asserting the #LOCK bus signal (which isn't even a thing for addresses in DRAM on CPUs with built-in memory controllers), but then you're also using xacquire. It's not surprising that xacquire completely changes the behaviour, because it's no longer actually doing an atomic operation, but more like attempt the transaction and then abort/commit (maybe like an LL/SC machine). RTM is easier to understand.Tangle
I use 'ia64' to mean "Intel Core 64-bit architecture", not Itanium. It was just a test program. I will post fully working version back when I've fully explained why two locking 'subl's can still produce the same value.Wheelchair
I use 'ia64' to mean "Intel Core 64-bit architecture", not Itanium. Then you're wrong; don't do that :P IA64 already has a specific technical meaning in the context of Intel CPUs. Valid terms are Intel64, x86-64, and amd64. Or just x86, because your function would compile and work on 32-bit x86 as well.Tangle
I did try just the plain 'lock subl ...' (__ATOMIC_SEQ_CST | __ATOMIC_HLE_ACQUIRE) AND the 'lock xacquire subl' (__ATOMIC_SEQ_CST) versions, but both of them deadlock getting the same '-1' result, which I thought from reading the docs should be impossible - I am just trying to locate the source of my mis-reading - thanks!Wheelchair
when I've fully explained why two locking 'subl's can still produce the same value. They can't unless you disable actual locking with xacquire.Tangle
I did try without xacquire but got same results.Wheelchair
Re: mapping WC memory: Apparently it is possible in user-space under Linux: how to map memory as USWC under windows/linux?. But like I said, you definitely don't want this. Normal locked operations work efficiently on WB memory without write-back to DRAM at all, and HLE is also designed to work efficiently on user-space mutexes in WB memory.Tangle
What exactly do you deadlock on? Is it the futex system call? Are you asking about it, or are you asking about lock sub? If I want a fallback to OS-assisted sleep/wake, I call pthread_mutex_lock or use a C++ mutex. If I want to roll my own lockless code, I use C11 or C++11 atomics. I haven't had to deal with a case where I needed to fall back to an OS-assisted sleep/wake in a retry loop in a lockless algo. People have implemented such things, though, in code like liblfds: Lock-free Progress GuaranteesTangle
It's really hard to read your code because it's incorrectly indented, and it's so cluttered with giant if() blocks for instructions that should be irrelevant. A much simpler minimal reproducible example would be better, without any of the clflush or fence crap.Tangle
Anyway, lock subl $1, (%rdi) is 100% atomic even if the pointer is misaligned (but much slower in that case), and a full memory barrier. If your test is finding it isn't, your hardware is broken or your test is broken.Tangle
Problem still occurs when intel_lock1.c (available at URL above) is compiled on linux with GCC 5 or 7 without either of the args '-D_WITH_CLFLUSH_' or '-D_WITH_HLE_' (so that neither CLFLUSH* nor HLE XACQUIRE are used) - the mutex_lock assembler now looks like: # 74 "intel_lock1.c" 1 LFENCE lock subl $1, lck(%rip) rep nop SFENCE So, I'm trying replacing {L,S}FENCE with MFENCE . I still don't quite understand how two threads can end up with same -1 *lck value though.Wheelchair
See updated [ intel_lock1.c ] : drive.google.com/open?id=1je5lNcv7nzS802BJweM4NVUcYYIwfxQn The updated version can be compiled with -D_NO_FENCE_ to not use any fences at all, or to use mfence with -D_MFENCE_ - still, problem occurs : both threads get -1 . The specific question I was trying to ask is 'why do two threads get the same value when both execute: 'lock subl , $addr , 1' and contents of *addr is initially 0 ? 'Wheelchair
Your copy/paste made a mess of URLs and formatting in your comments and "answer".Tangle
I just fixed that , sorry: drive.google.com/open?id=1je5lNcv7nzS802BJweM4NVUcYYIwfxQnWheelchair
RE: > If I want to roll my own lockless code, I use C11 or C++11 atomics. but that would work only for multiple threads, not also multiple processes. I'm trying to eventually, through refining these test cases, get an improved lightweight locking class that works for threads and processes if locks are in file backed shared memory maps.Wheelchair
I do have a working prototype, but that one still uses static constants instead of arithmetic - I've never been able to get a version that uses arithmetic on the mutex value to work, because of these deadlock issues that to me should not be happening.Wheelchair
There is a lot to be said for reducing your problem, before posting it to stack overflow, especially if you are trying to claim some type of compiler or hardware misbehavior or mis-specification. This is actually the type of question I'd be interested in, but it's just a huge mess of complex code, prose and speculation. If I understood it you are getting at some unexpected behavior with respect to locked atomic operations, but you have have also mixed in a big dose of Intel's RTM stuff. The set of people who will tackle this kind of x86-asm-and-C++-memory-ordering stuff is already ...Intervene
... quite limited, and by throwing in the extraneous xacquire stuff, which had nothing to do with the problem you describe, it just cuts it down further (for example, I don't know the RTM stuff well so I wouldn't try to answer a question apparently about its intricacies). Then, on top, you add the futex stuff - a pretty obscure call that few people are going to know. If I understand it correctly, you've already determined that the values are "wrong" before the futex call so why not just cut that part out? You are entitled to ask questions how you want, but response may be limited (GIGO).Intervene
T
2

lock subl $1, (%rdi) or lock xaddl %eax, (%rdx) are both 100% atomic in all cases, even if the pointer is misaligned (but much slower in that case), and are full memory barriers. On cacheable memory, there won't be any external #LOCK bus signal; the internal implementation just locks the cache line in M state of MESI inside a core that's running the locked instruction. See Can num++ be atomic for 'int num'? for more details.

If your test is finding it isn't atomic, your hardware is broken or your test is broken. Finding a deadlock tells you there's a bug in your design, not that your atomic primitive building-blocks aren't atomic. You can very easily test atomic increments by using two threads to increment a shared counter, and notice that no counts are lost. Unlike if you used addl $1, shared(%rip) without lock, where you would see lost counts.

Also, lfence, sfence, and pause have no effect on correctness in the normal case (no NT stores, and using only WB (Write-Back) memory). If any of your fence / clflush stuff is helping, it's only by adding an extra delay somewhere that is maybe making that thread always lose a race in your test, not actually making it safe. mfence is the only fence that matters, blocking StoreLoad reordering and store-forwarding effects. (Which is why gcc uses it as part of implementing a seq-cst store).

Get a basic version working right before you even think about messing around with HLE / transactional memory.


Race condition in your first version of acquiring the lock

x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_); is atomic, and only one thread's lock sub can change ua from 0 to -1 and get x=-1 from there.

But you aren't using the sub_fetch result, you're doing another load with
while((x = __atomic_load_n(ua,_LD_SEQ_CST_)) != 0)

So another thread can see ua=-1 if the first thread locks and then unlocks between the lock sub and the load in that 2nd thread.

The reason it's called sub_fetch is that it atomically returns the old value, as well as atomically modifying the value in memory. The fact that you discard the sub_fetch result is why it can compile to lock sub at all, instead of lock xadd with a register holding -1.

(Or a smart compiler could compile it to lock sub and check ZF, because you can tell when the value became non-zero or negative from flags set by lock sub.)


See C & low-level semaphore implementation for a simple semaphore with no fallback to OS-assisted sleep/wake. It spins on a load until we see a value greater than 0, then attempts to take the lock with C11 fetch_add(-1).

But if it loses the race to another thread, it undoes the decrement.

This is probably a poor design; it's probably best to attempt the decrement with a lock cmpxchg, so threads that fail won't have to undo their decrement.


I haven't used HLE, but I assume this bug is what breaks your HLE locking as well.

You don't need SFENCE, LFENCE, or CLFLUSH[OPT] or anything. lock xadd is already a full memory barrier and 100% atomic on its own, on any memory type (including WB).

You probably misread the SDM if you thought it said you should avoid WB memory for mutexes / semaphores.


You also have a race window during wakeup that can lead to deadlock

This code in mutex_lock looks broken / race-prone

x = __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);  // ok, fine
_mm_pause();   // you don't want a pause on the fast path.

if( x < 0 )   // just make this a while(x<0) loop
do {
   futex(..., FUTEX_WAIT, ...);

   x = __atomic_load_n(ua,_LD_SEQ_CST_);        // races with lock sub in other threads.
} while(x < 0);

Given thread A sleeping in futex with lck == -1 (if that's possible?):

  • thread B unlocks, resulting in lck == 0, and calls futex(FUTEX_WAKE)
  • thread A wakes up, futex returns while lck is still 0
  • some other thread (B or a 3rd thread) enters mutex_lock and runs __atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST_);, leaving lck == -1
  • thread A runs x = __atomic_load_n(ua,_LD_SEQ_CST_); at the bottom of its loop and sees -1

Now you have 2 threads stuck in the futex wait loop, and no thread actually got the mutex / entered the critical section.


I think your design is broken if it depends on doing a load after futex returns

The example in the futex(2) man page of fwait() shows it returning after futex returns, without loading again.

futex() is an atomic compare-and-block operation. Your design changes your counter value to -1 if one thread is waiting for the lock while a third thread tries to acquire it. So possibly your design is ok for 2 threads, but not for 3.

It's probably a good idea to use an atomic CAS for the decrement, so you never actually change lck to -1 or lower, and futex can stay blocked.

Then if you can count on it to only ever wake 1, then can you also trust its return value to mean you really have the lock without the race-prone separate load. I think.

Tangle answered 2/6, 2018 at 16:19 Comment(12)
Not an answer to problems of latest code. See discussion.Wheelchair
To address issue raised in above comment: ` If one thread enters mutex_lock (intel_lock2.c arithmetic using version) and the lock value is 0, it means the other thread owns the lock, and the lock value becomes -1; but there is a pending increment ` I am only focusing on getting critical sections for 1 producer and 1 consumer thread working here. `Wheelchair
@"Peter Cordes" : RE: >o thread A wakes up, futex returns while lck is still 0 futex cannot do this, because lck value was 0 at entry ; futex(WAIT) only returns to the waiter when its monitored pointer has changed value and another thread does a futex(WAKE).Wheelchair
@"Peter Cordes" : RE >o some other thread (B or a 3rd thread) enters mutex_lock and runs _atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST);, leaving lck == -1Wheelchair
@"Peter Cordes" : RE >o some other thread (B or a 3rd thread) enters mutex_lock and runs _atomic_sub_fetch( ua, 1, _ACQ_SEQ_CST);, leaving lck == -1 But there is no 3rd thread ! For lck to be 0, B in this case must be the thread allowed to continue to its critical section, which can only end with it invoking unlock() . So A comes in, finds the lck value has gone negative, and enters futex until B calls unlock and sets lck to 0 . There is no other thread to come in and mess up the maths here!Wheelchair
Somehow thread A's decrement is taking place and thread B's increment is not being seen , which would seem to me to violate the strong sequential ordering constraints and locking of the __atomic_sub_fetch(ua , x, __ATOMIC_SEQ_CST) or __atomic_add_fetch(ua, x, __ATOMIC_SEQ_CST) operations being used. It would be nice to be able to maintain a true waiter count in the lock value, since linux can't tell us how many waiters there are, but I don't see how given the lack of reliable atomic arithmetic operations on the Intel.Wheelchair
@JVD: I don't have time to look at this in detail. I think you're right that -1 is only possible if there's a third thread, though. But futex is an atomic compare-and-block, so it's probably best to keep the lck value at 0 while its locked, even with multiple waiters. Use lock cmpxchg. See my update to the last section.Tangle
Something about the intel_lock2.c code is telling the processor branch predictor that 'unlock' must PRECEDE 'lock', so that it is always trying to run any available unlock in preference to an lock; or something about the locking arithmetic operations is making it get a prediction fundamentally wrong. When I look at strace output of intel_lock3.c, which uses just constants without the locked arithmetic, I see the WAITs occur before the WAKEs . but the reverse in the case of intel_lock2.c .Wheelchair
OK, I ammended intel_lock2.c to keep a circular log in memory of lock states and TSC values, so we can see the sequence of operations that lead up to the deadlock:Wheelchair
T1: {st = LS_Lock, lv = 1, tsc = 101130538063127} T2: {st = LS_Lock, lv = 1, tsc = 101130538073306} T2: {st = LS_Unlock, lv = 1, tsc = 101130538073429} T2: {st = LS_Lock, lv = 1, tsc = 101130538073567} T1: {st = LS_Unlock, lv = 1, tsc = 101130538074118} T1: {st = LS_Lock, lv = 0, tsc = 101130538084471}Wheelchair
So what happens is threads A (T1) and B(T2) enter lock() with the lock (lv) unlocked (lck:1), then B gets the lock (lck:0), unlocks (lck:1), and enters lock() again (lck:0), but meanwhile A has been woken, which completes and unlocks and immediately enters lock() (lck:-1) again before B had woken . I think the lock value (lv) shown in this record : T2: {st = LS_Lock, lv = 1, tsc = 101130538073567} should have been 0, OR should have been 0 here : T1: {st = LS_Unlock, lv = 1, tsc = 101130538074118}. If the arithmetic was behaving as expected. one or the other should have happened.Wheelchair
@JVD: Are you still reading the lock value with a separate __atomic_load_n(ua,_LD_SEQ_);? That gives the other thread time to modify it again after the futex or lock xadd or whatever finishes. (Sorry if that isn't helpful, I haven't been following your code changes. I'd suggest writing down on paper (or in a text file like this) possible orderings of operations between the two threads. seq-cst means there's no reordering within a thread, but ops from separate threads can reorder. Interleave in different ways to look for an order that explains your dataTangle
W
-2

The latest example intel_lock2.c program at

: https://drive.google.com/open?id=10yLrohdKLZT4p3G1icFHdjF5eHY68Yws

now works as well as the latest intel_lock3.c program at

: https://drive.google.com/open?id=1ElB0qmwcDMxy9NBYkSXVxljj5djITYxa

and there is now a version that keeps an accurate negative waiter count, and which uses locked arithmetic, at:

intel_lock4.c: https://drive.google.com/open?id=1kNOppMtobNHU0lfkfWTh8auXvRcbZfhO

The unlock_mutex() routine, IFF there are waiters, must wait for each existing waiter to unlock, so that when it returns, the mutex is unlocked and there are no waiters. It can either achieve this through spin-locking + sched_yield() waiting for the lock value to become 1, or it can use another futex call. So the original locker, when it enters mutex_unlock(), becomes responsible for ensuring that every existing waiter wakes up and unlocks the mutex.

Previously this answer contained:

But there is still weirdness : if either process is ptrace-ed() by strace or compiled with '-g3' instead of '-O3', it now experiences an 'Inconsistency' - ie. inconsistent critical section modified values. This does not occur if the program is not ptrace-d and compiled with -O3 .

See discussion below. In order for GCC's builtin __atomic* functions to work, GCC's optimization phases must be invoked, with ANY -O$x flag specified during compilation sufficing to enable correct operation of the __atomic* builtins.

Final best version of the mutex_lock() / unlock routines:

static inline __attribute__((always_inline))
bool mutex_lock( register _Atomic volatile int *ua )
// lock the mutex value pointed to by 'ua';
// can return false if operation was interrupted ( a signal received ).
{ register int x;
  // lock_again:
  x = __atomic_add_fetch( ua, -1, _ACQ_SEQ_);
  while( x < 0 )
  { switch(syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0))
    {case 0:
      break;
     case -1:
      switch( errno )
      { case EINTR:
         return false;
        case EAGAIN:
          break;
        default:
          // this has never been observed to happen, but in any 
          // production implementation
          // should be replaced by some kind of 
          // 'throw( exception )' statement:
          fprintf(stderr,"Unexpected futex WAIT error: %d : '%s'.",
                  errno, strerror(errno));
          return false;
       }
    }
    x = __atomic_load_n(ua,_LD_SEQ_);
  }
  return true;
}

static inline __attribute__((always_inline))
bool mutex_unlock( register _Atomic volatile int *ua )
// unlock: returns false only if interrupted, else returns true
// only when the mutex pointed to by *ua has been unlocked and 
// has no waiters.
{
#ifdef _WITH_UWAIT_
  static int has_unlock_waiter = 0;
#endif
  register int x;
  x = __atomic_add_fetch( ua, 1, _REL_SEQ_);
  if(x < 1) // there was at least ONE waiter, 
            // so we are the original locker
  { while(1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
    { if( errno != 0 )
        switch(errno)
        {case EINTR:
          return false;
         case EAGAIN:
          break;
         default:
           // never observed to happen - should be a throw()
          fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", 
                  errno, strerror(errno));
          return false;
        }
    }
#ifdef _WITH_UWAIT_
// this is strictly unnecessary, and can be replaced by use of
// sched_yield() (see below), but it
// makes the situation clearer:
// unlock :
    // so we have woken a waiter; wait for that waiter to 
    // actually unlock before returning -
    // by definition, when that waiter enters mutex_unlock() 
    // (AND IT MUST!!), it will not
    // enter the clause containing this code unless there is more than
    // one other waiter., in which case we want to continue until there
    // are no waiters.
    while(1 > (x = __atomic_load_n( ua, _LD_SEQ_ )))
    { __atomic_store_n(&has_unlock_waiter, 1, _ST_SEQ_);
      if( (-1 == 
          syscall( SYS_futex, ua, FUTEX_WAIT, x, nullptr,nullptr,0)
          ) && (errno == EINTR)
        ) return false;
    }
    if( __atomic_load_n(&has_unlock_waiter, _ST_SEQ_) )
      __atomic_store_n(&has_unlock_waiter, 0, _ST_SEQ_);
#else
// The same result is actually achieved by this loop:
    while(1 > (x = __atomic_load_n(ua, _LD_SEQ_)))
      sched_yield();
#endif
    // we do need to wait for the waiting locker to unlock 
    // before proceeding, else
    // mutex_lock could be reentered with lck < 0 and deadlock 
    // would result.
#ifdef _WITH_UWAIT_
  }else if( (x==1) && __atomic_load_n(&has_unlock_waiter, _ST_SEQ_) )
  { // so we're the waiter that a previous unlock woke up 
    // and is waiting for - it now needs to be woken:
    while(1 < syscall( SYS_futex, ua, FUTEX_WAKE, 1, nullptr,nullptr,0))
    { if( errno != 0 )
        switch(errno)
        {case EINTR:  // no, we cannot let user try to unlock again, since modification of lock value succeeded.
         case EAGAIN:
          break;
         default:
          fprintf(stderr,"Unexpected futex WAKE error: %d : '%s'.", errno, strerror(errno));
          return false;
        }
    }
  }
#else
  }
#endif
  return true;
}

Testing:

$ gcc -std=gnu11 -pthread -D_WITH_UWAIT_ -O3 -o il2 il2.c
$ ./il2
^C20906015
$ gcc -std=gnu11 -pthread -O3 -o il2 il2.c
$ ./il2
^C45851541

('^C' means pressing + keys simultaneously).

Now all versions never deadlock and do work with :

$ strace -f -e trace=futex ./{intel_lock2 OR intel_lock3 OR intel_lock4} 

I was trying to strace a '-g' (only) compiled version and got an Inconsistency - this does not happen if ANY '-O' flag also used.

Wheelchair answered 3/6, 2018 at 22:41 Comment(8)
If you need to call sched_yield after unlocking to avoid race conditions, your code isn't truly safe. It just makes it harder to find the bugs by almost always hiding the race, at least when the system is not heavily.Tangle
Absolutely, but it points to the cause of the issue: somehow the combination of the locked arithmetic and the 'mfence'-es inserted by the compiler around the atomic stores of the exit condition variable is causing the processor to believe that an unlock must always "happen before" a lock, so that unlock ends up hogging the CPU and lock doesn't get scheduled. But at least I'm seeing the same problem with both versions now. I'm more interested now in the fact that no version is mutually exclusive without being compiled with -O3 ; if -gX compiled, mutual exclusion is not achieved.Wheelchair
You can compile with -O3 -g. Adding debug metadata is not mutually exclusive with optimization. (You sometimes see "optimized out" when trying to examine local variables, or see a stale value for a global, when debugging optimized code, though, because debug info can't track variables when they're in registers.)Tangle
Yes, sorry, to be more precise, in order for either example to achieve mutual exclusion, (not exit with "Inconsistency!" msg), it is necessary to compile with ANY '-O' flag , regardless of any '-g' flag being also present. ie. compilation with just '-O' or '-O1' works with / without any '-g$X', but with any '-g$X' flag, a '-O${x}' flag MUST be specified for mutual exclusion to be achieved. After all, the gcc _atomic* builtins seem to affect primarily the optimization phases, and if none are run, I guess their special features are disabled .Wheelchair
Is this an answer to the question or just an update on your investigation, or what?Intervene
@JVD: No, _atomic built-ins and C++11 / C11 atomics still work correctly at the default -O0, because lock add is still necessary in un-optimized code. You keep mistrusting your tools and hardware, but it's almost certain you're just using them wrong. Compiling without optimization probably just opens up a race window wide enough for your testing to find it consistently.Tangle
Yes, there was a bug - I was testing for rollover ( var < last_var) OUTSIDE the critical section. In the -O or -O${x > 0} compiled versions, this test did not get triggered, in the -O0 compiled versions, it did. Sorry for confusion. The fixed version, which works when compiled with or without any '-O$x' flag, is at : drive.google.com/open?id=1kNOppMtobNHU0lfkfWTh8auXvRcbZfhOWheelchair
@BeeOnRope: both.Wheelchair

© 2022 - 2024 — McMap. All rights reserved.