x86 equivalent for LWARX and STWCX

Asked 18/7, 2009 at 16:8 Answered 10/8, 2010 at 13:34

Solved multithreading x86 low-level reference-counting

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).

Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *pval;
  do
  {
    // fetch the pointer to the value
    pval = *ptr;

    // if its NULL, then just return NULL, the smart pointer
    // will then become NULL as well
    if(pval == NULL)
      return NULL;

    // Grab the reference count
    val = lwarx(pval);

    // make sure the pointer we grabbed the value from
    // is still the same one referred to by  'ptr'
    if(pval != *ptr)
      continue;

    // Increment the reference count via 'stwcx' if any other threads
    // have done anything that could potentially break then it should
    // fail and try again
  } while(!stwcx(pval, val + 1));
  return pval;
}

I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).

Thanks

Charlie answered 18/7, 2009 at 16:8 Comment(0)

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.

It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).

If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

Gunnar answered 20/7, 2009 at 12:45 Comment(9)

DCAS almost looks right, except i need to change 1 word only if a pointer to that word doesn't change while doing this (thats a little confusing, hopefully the update to the question helps clarify this). – Charlie 21/7, 2009 at 10:32

I managed to find a workaround using DCAS, its not foolproof, as it uses a unique ID (4 bytes in size) but the chances of it breaking are slim because both the 4 byte UID and the 4 byte counter adjacent to it must be replicated exactly. This is only a problem if something deletes the object reassigns the memory to something else and then manages to duplicate those 8 bytes while another thread is trying to copy a pointer, which is a relatively short operation (operation wise that is, length is only long enough if the thread is interrupted) – Charlie 9/8, 2009 at 12:20

I don't know about the PPC in particular, but on most machines, Load-Exclusive/Store-Conditional instructions don't really help with the ABA problem because memory operations performed between a load-exclusive and store-conditional may cause the store-conditional operation to spontaneously fail. If one rereads the guarded location and sees that it has changed, one can tell that something else wrote it with a new value, but if it holds the same value as on the previous read, there will be no way to distinguish a spontaneous failure from an ABA write. – Diplomat 10/9, 2014 at 15:35

When doing something like a linked-list insert, whose protocol would require reading an old pointer, storing it in a new list item, and then updating the old pointer to reference the new item, an outside ABA write could be a danger, but on some machines code which tries to LX the old pointer, store it to the new item, and SC the new pointer could loop endlessly even without any outside interference, if e.g. the old and new objects inhabit the same cache line, or inhabit cache lines which have certain address bits in common. Note that an LL/SC implementation could legitimately... – Diplomat 10/9, 2014 at 15:42

...have any store to a shared memory which takes place between an LX and an SC invalidate the latter [such an implementation, though simple, would suffice in many situations, especially in NUMA architectures where processors would keep most of their data in local memory, or in cases where there's only one main processor core, and peripheral devices may update the memory but won't generally flood it with a continuous stream of memory writes. – Diplomat 10/9, 2014 at 15:52

@Diplomat This is going back a fair while now (haven't worked on PPC in years) but IIRC, the lwarx instruction does correctly handle the ABA problem because it becomes invalid on any memory operation on the target cache line except for the matching stwcx call. It doesn't check the value of the address, but whether it has been accessed. Considering this is the only way to implement atomic operations on the PPC, it'd be silly if it wasn't ABA safe – Charlie 26/6, 2015 at 1:26

@GrantPeters: For an LX/SC sequence to be useful for anything it must always be possible for the SC to succeed in the absence of interference from other threads, and for two or more threads that are attempting simultaneous LX/SC to have at least one succeed. On many platforms, this imposes severe limits on what code can do between an LX and SC. If code needs to read a variable and store things derived from it into other variables before conditionally updating the original, it may be that the act of updating those other variables would cause the LX/SC to always fail (in a simplistic... – Diplomat 26/6, 2015 at 15:8

...implementation, any store could invalidate a pending LX/SC). An LX/SC restricted in such fashion could still be usable to synthesize operations like compare-and-swap, but CAS isn't ABA safe. Other implementations might be able to do more between an LX and SC, but I don't know of standard conventions for what would be guaranteed. For example, if one is trying to do a linked-list update and the new item happens to share a cache line with the old one, the sequence "LX (temp=old_item.link); new_item.link = temp; SC(old_item.link=new_item)" might never be able to succeed. – Diplomat 26/6, 2015 at 15:18

If one had a system where loads could never invalidate a pending LX, it might be possible to code fully-ABA-safe algorithms in such a way that no stores would be required between an LX and an SC, but I don't know that such algorithms could be made immune to live-lock without adding random delays. – Diplomat 26/6, 2015 at 15:23

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

Stonedeaf answered 18/7, 2009 at 16:30 Comment(0)

You're probably looking for the cmpxchg family of instructions.

You'll need to precede these with a lock instruction to get equivalent behaviour.

Have a look here for a quick overview of what's available.

You'll likely end up with something similar to this:

mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12

You should read this paper...

Edit

In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

Libre answered 18/7, 2009 at 16:31 Comment(1)

Those 2 links are quite good (actually stumbled across those same 2 pages a few days ago), but unfortunately not what I'm looking for (I updated the question to better reflect this) – Charlie 21/7, 2009 at 10:26

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.

int* IncrementAndRetrieve(int **ptr)
{
  int val;
  int *unpacked;
  do
  {   
    val = *ptr;
    unpacked = unpack(val);

    if(unpacked == NULL)
      return NULL;
    // pointer is on the bottom
  } while(!cas(unpacked, val, val + 1));
  return unpacked;
}

Swoosh answered 26/9, 2009 at 17:18 Comment(1)

Memory doesn't have to be allocated at the lowest heap, so you can't be sure of this, unless you're specifying the addresses yourself (which i am), unfortunately, i'm not on a 64-bit platform, but this might be useful in the future. – Charlie 27/9, 2009 at 13:7

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.

Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

Postmeridian answered 10/8, 2010 at 13:34 Comment(1)

Yeah, this is kind of a non issue these days, in the end I've opted for more manual management and training the rest of the coders in the company how to properly do multi-threading via a couple of lock free structures that facilitate inter-thread communication. – Charlie 12/8, 2010 at 14:21

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).

The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Cyclostyle answered 25/7, 2009 at 15:30 Comment(2)

the "if(pval != ptr) continue;"is safe because whenever another thread changes a smart pointer, it will also alter the counter that its pointing to, hence, it will invalidate the stwcx as that value does get changed, and that is what is being monitored for change (just requires some careful structuring) – Charlie 27/7, 2009 at 11:21

You really need to post the other side too, then. I just tried to build an answer but there was too much guessing involved. Usually, these kinds of problems can be solved using CAS. – Cyclostyle 3/8, 2009 at 16:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags