Reading interlocked variables
Asked Answered
G

10

27

Assume:

A. C++ under WIN32.

B. A properly aligned volatile integer incremented and decremented using InterlockedIncrement() and InterlockedDecrement().

__declspec (align(8)) volatile LONG _ServerState = 0;

If I want to simply read _ServerState, do I need to read the variable via an InterlockedXXX function?

For instance, I have seen code such as:

LONG x = InterlockedExchange(&_ServerState, _ServerState);

and

LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);

The goal is to simply read the current value of _ServerState.

Can't I simply say:

if (_ServerState == some value)
{
// blah blah blah
}

There seems to be some confusion WRT this subject. I understand register-sized reads are atomic in Windows, so I would assume the InterlockedXXX function is unnecessary.

Matt J.


Okay, thanks for the responses. BTW, this is Visual C++ 2005 and 2008.

If it's true I should use an InterlockedXXX function to read the value of _ServerState, even if just for the sake of clarity, what's the best way to go about that?

LONG x = InterlockedExchange(&_ServerState, _ServerState);

This has the side effect of modifying the value, when all I really want to do is read it. Not only that, but there is a possibility that I could reset the flag to the wrong value if there is a context switch as the value of _ServerState is pushed on the stack in preparation of calling InterlockedExchange().

LONG x = InterlockedCompareExchange(&_ServerState, _ServerState, _ServerState);

I took this from an example I saw on MSDN.
See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx

All I need is something along the lines:

lock mov eax, [_ServerState]

In any case, the point, which I thought was clear, is to provide thread-safe access to a flag without incurring the overhead of a critical section. I have seen LONGs used this way via the InterlockedXXX() family of functions, hence my question.

Okay, we are thinking a good solution to this problem of reading the current value is:

LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);
Grampus answered 23/4, 2009 at 1:44 Comment(3)
This is a classic question where you ask if what you do is OK instead of describing your real problem and ask how to deal with it. If you could describe what you are trying to achive someone may have an idea that you might not even have considered.Headrace
Seems that the OP is much more knowledgeable than the ones answering...Neuroglia
I have been wondering about this myself lots of times. The only time you could safely just read ( or write for that matter ) without an atomic function is when you read a bool value i.e. a flag. It's either false/zero or true/non-zero, the exact value of the bits involved is not that important. Or am I overlooking something?Dorolice
A
14

It depends on what you mean by "goal is to simply read the current value of _ServerState" and it depends on what set of tools and the platform you use (you specify Win32 and C++, but not which C++ compiler, and that may matter).

If you simply want to read the value such that the value is uncorrupted (ie., if some other processor is changing the value from 0x12345678 to 0x87654321 your read will get one of those 2 values and not 0x12344321) then simply reading will be OK as long as the variable is :

  • marked volatile,
  • properly aligned, and
  • read using a single instruction with a word size that the processor handles atomically

None of this is promised by the C/C++ standard, but Windows and MSVC do make these guarantees, and I think that most compilers that target Win32 do as well.

However, if you want your read to be synchronized with behavior of the other thread, there's some additional complexity. Say that you have a simple 'mailbox' protocol:

struct mailbox_struct {
    uint32_t flag;
    uint32_t data;
};
typedef struct mailbox_struct volatile mailbox;


// the global - initialized before wither thread starts

mailbox mbox = { 0, 0 };

//***************************
// Thread A

while (mbox.flag == 0) { 
    /* spin... */ 
}

uint32_t data = mbox.data;

//***************************

//***************************
// Thread B

mbox.data = some_very_important_value;
mbox.flag = 1;

//***************************

The thinking is Thread A will spin waiting for mbox.flag to indicate mbox.data has a valid piece of information. Thread B will write some data into mailbox.data then will set mbox.flag to 1 as a signal that mbox.data is valid.

In this case a simple read in Thread A of mbox.flag might get the value 1 even though a subsequent read of mbox.data in Thread A does not get the value written by Thread B.

This is because even though the compiler will not reorder the Thread B writes to mbox.data and mbox.flag, the processor and/or cache might. C/C++ guarantees that the compiler will generate code such that Thread B will write to mbox.data before it writes to mbox.flag, but the processor and cache might have a different idea - special handling called 'memory barriers' or 'acquire and release semantics' must be used to ensure ordering below the level of the thread's stream of instructions.

I'm not sure if compilers other than MSVC make any claims about ordering below the instruction level. However MS does guarantee that for MSVC volatile is enough - MS specifies that volatile writes have release semantics and volatile reads have acquire semantics - though I'm not sure at which version of MSVC this applies - see http://msdn.microsoft.com/en-us/library/12a04hfd.aspx?ppud=4.

I have also seen code like you describe that uses Interlocked APIs to perform simple reads and writes to shared locations. My take on the matter is to use the Interlocked APIs. Lock free inter-thread communication is full of very difficult to understand and subtle pitfalls, and trying to take a shortcut on a critical bit of code that may end up with a very difficult to diagnose bug doesn't seem like a good idea to me. Also, using an Interlocked API screams to anyone maintaining the code, "this is data access that needs to be shared or synchronized with something else - tread carefully!".

Also when using the Interlocked API you're taking the specifics of the hardware and the compiler out of the picture - the platform makes sure all of that stuff is dealt with properly - no more wondering...

Read Herb Sutter's Effective Concurrency articles on DDJ (which happen to be down at the moment, for me at least) for good information on this topic.

Anathematize answered 23/4, 2009 at 4:41 Comment(4)
"the processor and/or cache might" -- Wrong. Results are always retired in-order. Instructions are run out of order if they're free of interdependencies, but results are ALWAYS written in the order they're expected.Neuroglia
@Zach: that may be true for x86 architectures; I'm not sure it's true for ia64. I'm also not sure if its' going to be true for future architectures (I hear that Win8 is supposed to be getting ARM support - I don't know what the memory model for multicore ARM is). Finally, note that Microsoft explicitly documents this: msdn.microsoft.com/en-us/library/ms686355.aspxAnathematize
@Michael: Were you not answering OP's question? A. C++ under Win32... Not WinARM, Win64, IA64, etc.Neuroglia
This is potentially wrong today. Old answer, but since it is the most accepted, I would appreciate if you could correct/comment with regard to the current status. First, volatile does have a different meaning, and it might do nothing here. Also on other hardware architectures the picture might be different. Then, there is no mentioning of memory barriers and "visibility". Atomicity alone is not sufficient for all concurrency issues - they are still data races. (See also Microsoft Specific /volatile:ms compiler optionDiaphragm
N
10

Your way is good:

LONG Cur = InterlockedCompareExchange(&_ServerState, 0, 0);

I'm using similar solution:

LONG Cur = InterlockedExchangeAdd(&_ServerState, 0);
Nevus answered 14/11, 2013 at 15:54 Comment(1)
Thanks for providing an extra nice solution.Barhorst
L
6

Interlocked instructions provide atomicity and inter-processor synchronization. Both writes and reads must be synchronized, so yes, you should be using interlocked instructions to read a value that is shared between threads and not protected by a lock. Lock-free programming (and that's what you're doing) is a very tricky area, so you might consider using locks instead. Unless this is really one of your program's bottlenecks that must be optimized?

Laurinda answered 23/4, 2009 at 3:19 Comment(7)
So the whole cache coherency logic can be thrown out in the next CPU iteration since obviously no one seems to care to utilize it... what a waste of silicon real estate!Neuroglia
Indeed, a lot of cache coherency logic has been thrown out in modern multicores (it's just too slow) and replaced by relaxed coherency models. Processor cores don't have the same view of memory unless explicitly requested using locked instructions or memory barriers. That includes the good old x86.Laurinda
Wrong. You're getting confused with IA64 (a modern but dying architecture). x86 and x64 have been and will continue to implement strict memory model.Neuroglia
The fact that the x86 has mfence instruction is a good indicator that it has a relaxed memory model. In a strict (sequentially consistent) model no fences are necessary. To be exact, the x86 implements the TSO model (Total Store Order). I described it in my blog: blog.corensic.com/2011/08/15/data-races-at-the-processor-level .Laurinda
None of what you've written in that blog proves cache coherency logic has been thrown out! If your result hasn't even made it back to the cache (still in your conceptual store FIFO buffer), then cache snooping wouldn't make any difference now would it? Also, mfence is a guard to make sure all stores (to cache) finishes (basically a wait). Strict memory model doesn't mean one should ignore the effects of pipelining / clocks to store results back to cache.Neuroglia
Maybe I'm misunderstanding what you mean by "strict memory model." I interpreted it as sequentially consistent memory model.Laurinda
Maybe we are misunderstanding each other. Architecturally speaking, the x86 indeed implements the MESI coherency protocol in its cache. However, the writes don't go directly to the cache so the effect is a relaxed memory model.Laurinda
T
2

To anyone who has to revisit this thread I want to add to what was well explained by Bartosz that _InterlockedCompareExchange() is a good alternative to standard atomic_load() if standard atomics are not available. Here is the code for atomically reading my_uint32_t_var in C on i86 Win64. atomic_load() is included as a benchmark:

 long debug_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955  mov         eax,dword ptr [rbp+30h] 
00000001401A6958  xor         edi,edi 
00000001401A695A  mov         dword ptr [rbp-0Ch],eax 
    debug_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A695D  xor         eax,eax 
00000001401A695F  lock cmpxchg dword ptr [rbp+30h],edi 
00000001401A6964  mov         dword ptr [rbp-0Ch],eax 
    debug_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A6967  prefetchw   [rbp+30h] 
00000001401A696B  mov         eax,dword ptr [rbp+30h] 
00000001401A696E  xchg        ax,ax 
00000001401A6970  mov         ecx,eax 
00000001401A6972  lock cmpxchg dword ptr [rbp+30h],ecx 
00000001401A6977  jne         foo+30h (01401A6970h) 
00000001401A6979  mov         dword ptr [rbp-0Ch],eax 

    long release_x64_i = std::atomic_load((const std::_Atomic_long *)&my_uint32_t_var);
00000001401A6955  mov         eax,dword ptr [rbp+30h] 
    release_x64_i = _InterlockedCompareExchange((long*)&my_uint32_t_var, 0, 0);
00000001401A6958  mov         dword ptr [rbp-0Ch],eax 
00000001401A695B  xor         edi,edi 
00000001401A695D  mov         eax,dword ptr [rbp-0Ch] 
00000001401A6960  xor         eax,eax 
00000001401A6962  lock cmpxchg dword ptr [rbp+30h],edi 
00000001401A6967  mov         dword ptr [rbp-0Ch],eax 
    release_x64_i = _InterlockedOr((long*)&my_uint32_t_var, 0);
00000001401A696A  prefetchw   [rbp+30h] 
00000001401A696E  mov         eax,dword ptr [rbp+30h] 
00000001401A6971  mov         ecx,eax 
00000001401A6973  lock cmpxchg dword ptr [rbp+30h],ecx 
00000001401A6978  jne         foo+31h (01401A6971h) 
00000001401A697A  mov         dword ptr [rbp-0Ch],eax
Tambour answered 8/8, 2016 at 14:33 Comment(0)
G
1

32-bit read operations are already atomic on some 32-bit systems (Intel spec says these operations are atomic, but there's no guarantee that this will be true on other x86-compatible platforms). So you shouldn't use this for threads synchronization.

If you need a flag some sort you should consider using Event object and WaitForSingleObject function for that purpose.

Grandmotherly answered 27/7, 2009 at 4:36 Comment(4)
Wait, you just said 32-bit read ops are atomic, then why would you ask someone to consider using Event / WaitForSingleObject???Neuroglia
Other x86-compatible platforms -- You mean AMD?Neuroglia
The idea that there's no some sort of standard for that. MSDN doesn't says that there some sort of guarantee. I'm not aware about AMD. But note, that there are much more x86 systems and not only AMD and Intel.Grandmotherly
@ZachSaw let us continue this discussion in chatGrandmotherly
M
0

you should be okay. It's volatile, so the optimizer shouldn't savage you, and it's a 32-bit value so it should be at least approximately atomic. The one possible surprise is if the instruction pipeline can get around that.

On the other hand, what's the additional cost of using the guarded routines?

Magalimagallanes answered 23/4, 2009 at 1:50 Comment(2)
OMG another clown who's never been savaged by code in the pipeline.Magalimagallanes
You might want to look up the word "chauvinistic" son. It doesn't mean what you think it does.Magalimagallanes
T
0

Current value reading may not need any lock.

Termless answered 23/4, 2009 at 1:51 Comment(2)
As long as it's no wider than the width of the word, and the variable is marked as volatile, it won't. If you use a non-volatile variable and write to it via Interlocked, you can't just read it, since you may see the cached value.Emlin
@romkyns: The value can be wider than a word if it's in cacheable memory, fits within a single cache line (which will be true if it's naturally aligned), and is accessed via a single instruction. To be clear, in the non-volatile case, the value you see is not coming from a processor cache, but from a register that the compiler is free to load at almost any time.Glenn
S
0

The Interlocked* functions prevent two different processors from accessing the same piece of memory. In a single processor system you are going to be ok. If you have a dual-core system where you have threads on different cores both accessing this value, you might have problems doing what you think is atomic without the Interlocked*.

Siderostat answered 23/4, 2009 at 1:53 Comment(0)
N
0

Read is fine. A 32-bit value is always read as a whole as long as it's not split on a cache line. Your align 8 guarantees that it's always within a cache line so you'll be fine.

Forget about instructions reordering and all that non-sense. Results are always retired in-order. It would be a processor recall otherwise!!!

Even for a dual CPU machine (i.e. shared via the slowest FSBs), you'll still be fine as the CPUs guarantee cache coherency via MESI Protocol. The only thing you're not guaranteed is the value you read may not be the absolute latest. BUT, what is the latest anyway? That's something you likely won't need to know in most situations if you're not writing back to the location based on the value of that read. Otherwise, you'd have used interlocked ops to handle it in the first place.

In short, you gain nothing by using Interlocked ops on a read (except perhaps reminding the next person maintaining your code to tread carefully - then again, that person may not be qualified to maintain your code to begin with).

EDIT: In response to a comment left by Adrian McCarthy.

You're overlooking the effect of compiler optimizations. If the compiler thinks it has the value already in a register, then it's going to re-use that value instead of re-reading it from memory. Also, the compiler may do instruction re-ordering for optimization if it believes there are no observable side effects.

I did not say reading from a non-volatile variable is fine. All the question was asking was if interlocked was required. In fact, the variable in question was clearly declared with volatile. Or were you overlooking the effect of the keyword volatile?

Neuroglia answered 7/7, 2011 at 1:36 Comment(4)
This is not quite correct. On x86 processors, reads are retired in order and writes are retired in order, but a write may be delayed after a subsequent read. Allowing this reordering is pretty much required to make a store buffer valuable.Highspirited
On a uArch level in the Northwood/Prescott CPU for example, there's a result forwarding mechanism in the uOp retiring unit that allows immediate consumption of result to the front-end, if its result is indeed what the front end is waiting for. No reordering can take place when it threatens the validity of the results (i.e. register/var dependencies). Worse, in the OP's case, there's no read following a write.Neuroglia
You're overlooking the effect of compiler optimizations. If the compiler thinks it has the value already in a register, then it's going to re-use that value instead of re-reading it from memory. Also, the compiler may do instruction re-ordering for optimization if it believes there are no observable side effects.Charlenecharleroi
If volatile worked the way you think it does, barriers would not be necessary. Coherency and explicit instruction ordering (particularly when code is executing on separate processing units) are two different things. The interlocked functions have variants with ordering semantics for this very reason.Heshum
H
0

Your initial understanding is basically correct. According to the memory model which Windows requires on all MP platforms it supports (or ever will support), reads from a naturally-aligned variable marked volatile are atomic as long as they are smaller than the size of a machine word. Same with writes. You don't need a 'lock' prefix.

If you do the reads without using an interlock, you are subject to processor reordering. This can even occur on x86, in a limited circumstance: reads from a variable may be moved above writes of a different variable. On pretty much every non-x86 architecture that Windows supports, you are subject to even more complicated reordering if you don't use explicit interlocks.

There's also a requirement that if you're using a compare exchange loop, you must mark the variable you're compare exchanging on as volatile. Here's a code example to demonstrate why:

long g_var = 0;  // not marked 'volatile' -- this is an error

bool foo () {
    long oldValue;
    long newValue;
    long retValue;

    // (1) Capture the original global value
    oldValue = g_var;

    // (2) Compute a new value based on the old value
    newValue = SomeTransformation(oldValue);

    // (3) Store the new value if the global value is equal to old?
    retValue = InterlockedCompareExchange(&g_var,
                                          newValue,
                                          oldValue);

    if (retValue == oldValue) {
        return true;
    }

    return false;
}

What can go wrong is that the compiler is well within its rights to re-fetch oldValue from g_var at any time if it's not volatile. This 'rematerialization' optimization is great in many cases because it can avoid spilling registers to the stack when register pressure is high.

Thus, step (3) of the function would become:

// (3) Incorrectly store new value regardless of whether the global
//     is equal to old.
retValue = InterlockedCompareExchange(&g_var,
                                      newValue,
                                      g_var);
Highspirited answered 10/5, 2012 at 7:46 Comment(3)
What's the need to prevent processor reordering when you're simply reading? volatile int a = 10; (a == 10) would always be true! In multi CPU, you'd put a memory barrier after volatile int a = 10; but if you then read after the memory barrier, var 'a' would still turn out to be 10! Why would you need interlocked read?? *we're discussing accesses within a single cache line.Neuroglia
Consider implementing Peterson's Algorithm. That requires read ordering and atomicity.Highspirited
That algo is not what the OP is asking!Neuroglia

© 2022 - 2024 — McMap. All rights reserved.