Force order of execution of C statements?
Asked Answered
F

3

19

I have a problem with the MS C compiler reordering certain statements, critical in a multithreading context, at high levels of optimization. I want to know how to force ordering in specific places while still using high levels of optimization. (At low levels of optimization, this compiler does not reorder statements)

The following code:

 ChunkT* plog2sizeChunk=...
 SET_BUSY(plog2sizeChunk->pPoolAndBusyFlag); // set "busy" bit on this chunk of storage
 x = plog2sizeChunk->pNext;

produces this:

 0040130F 8B 5A 08 mov ebx,dword ptr [edx+8]
 00401312 83 22 FE and dword ptr [edx],0FFFFFFFEh 

in which the write to pPoolAndBusyFlag is reordered by the compiler to occur after the pNext fetch.

SET_BUSY is essentially

  plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;

I think the compiler has rightfully decided it was OK to reorder these accesses because they are to two separate members of the same struct, and such reordering has no affect on the results of single-threaded execution:

typedef struct chunk_tag{
unsigned pPoolAndBusyFlag;      // Contains pointer to owning pool and a busy flag
natural log2size;                   // holds log2size of the chunk if Busy==false
struct chunk_tag* pNext;            // holds pointer to next block of same size
struct chunk_tag* pPrev;            // holds pointer to previous block of same size
} ChunkT, *pChunkT;

For my purposes, the pPoolAndBusyFlag has to be set before other accesses to this structure are valid in a multithreaded/multicore context. I don't think this particular access is problematic for me, but the fact the compiler can reorder this means that other parts of my code may have the same kind of reordering but it may be critical in those places. (Imagine the two statements are updates to the two members rather than one write/one read). I want to be able force the order of the actions.

Ideally, I'd write something like:

 plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;
 #pragma no-reordering // no such directive appears to exist
 pNext = plog2sizeChunk->pNext;

I have experimentally verified I can get this effect in this ugly way:

 plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;
 asm {  xor eax, eax }  // compiler won't optimize past asm block
 pNext = plog2sizeChunk->pNext;

gives

 0040130F 83 22 FE             and         dword ptr [edx],0FFFFFFFEh  
 00401312 33 C0                xor         eax,eax  
 00401314 8B 5A 08             mov         ebx,dword ptr [edx+8]  

I note that the x86 hardware may reorder these particular instructions anyway since they don't refer to the same memory location, and reads may pass writes; to really fix this example, I'd need some type of memory barrier. Back to my earlier remark, if they were both writes, the x86 will not reorder them, and the write order will be seen in that order by other threads. So in that case I don't think I need a memory barrier, just a forced ordering.

I have not seen the compiler re-order two writes (yet) but I haven't been looking very hard (yet); I just tripped over this. And of course with optimizations just because you don't see it in this compilation doesn't mean it won't appear in the next.

So, how do I force the compiler to order these?

I understand I can declare the memory slots in the struct to be volatile. They are still independent storage locations, so I don't see how this prevents an optimization. Maybe I'm mis-interpreting what volatile means?

EDIT (Oct 20): Thanks to all the responders. My current implementation uses volatile (used as the initial solution), _ReadWriteBarrier (to mark the code where reordering shouldn't occur by the compiler), and a few MemoryBarriers (where reads and writes occur), and that seems to have solved the problem.

EDIT: (Nov 2): To be clean, I ended up defining sets of macros for ReadBarrier, WriteBarrier, and ReadWriteBarrier. There are sets for pre and post locking, pre and post unlocking, and general usage. Some of these are empty, some contain _ReadWriteBarrier and MemoryBarrier, as appropriate for the x86 and typical spin locks based on XCHG [XCHG includes an implicit MemoryBarrier thus obviating that need in lock pre-/post- sets). I then parked these in the code at appropriate places documenting the essential (non)reordering requirements.

Firdausi answered 10/10, 2013 at 9:32 Comment(12)
+1 well researched question.Markham
Um, you have no idea how painfully researched this was :-{Firdausi
Is SET_BUSY anything more than &= or is this code protected by a mutex or otherwise made thread-safe? Because if you are trying to synchronize threads using what basically amounts to if (!busy) { busy = true; DoStuff(); busy = false; } then that is not going to be thread-safe no matter the order.Korenblat
There are mutexes protecting this. Yes, SET_BUSY here is just &=. The particular memory block in this example isn't known to any other threads so this is safe. Blocks transit from "unknown" to BUSY to known. BUSY blocks can't be modified by any other than their specific thread owners. A block can't transit from BUSY to un-busy without a lock for all blocks in its class being acquired by the owner. Holding the lock, the owner fills it with information useful for tracking it while it is unbusy, and then marks the block unbusy. ...Firdausi
... what matters here for my question is that the tracking information MUST be stored in the block (1 write) before it is marked as unbusy (2nd write). If the compiler reorders those writes, the block gets marked as unbusy and the tracking information is wrong. Sure death.Firdausi
Hmm. The C++ standard seems to say that reads and writes to volatile variables cannot be moved past one another even if they aren't to the same location. Maybe that's the cure, assuming it applies to C.Firdausi
... marked pPoolAndBusyFlag and log2size as volatile, and voila, the compiler appears to no longer change the order of accesses, which is what I wanted. This may not actually solve my problem, but at least it isn' surprisingly out of order critical writes.Firdausi
About your edit: If you use volatile and are purely interested in MSVC that should take care about all the necessary memory barriers by itself.Snitch
For the moment, purely interested in MSVC (someday I'll care about GCC and maybe something other than x86 but life is hard enough with this for the moment). Volatile prevents the compiler from re-ordering reads and writes, but doesn't prevent the CPU from doing it. I still need memory barriers in critical places.Firdausi
That's true for standard volatiles, but MSVC gives stronger guarantees for volatiles if you use the volatile:ms flag (which happens to be the default for everything but ARM), hence my question. With that flag set volatiles behave just like they do on say the JMM or in .NET. I.e. writes have release and reads acquire semantics.Snitch
"volatile:ms"? This is a new bit of information. Where is this flag placed? Pragra? Command line? default?Firdausi
It's a commandline flag - I don't think there's a pragma to turn it on only for some code, but since it just gives stronger guarantees any code that will work without it will work fine with it. See here for the documentation. The MS volatile is succinct and well understood (for more information you probably want to look at stuff about Java volatiles - much more information around and it's equivalent to this extension) so a great solution imo if you don't mind non-portability (c only, std::atomic is the way for c++)Snitch
S
8

So as I understand it the pNext = plog2sizeChunk->pNext publishes the block so that it can be seen by other threads and you have to make sure they see the correct busy flag.

That means you need a uni-directional memory barrier before publishing it (also one before reading it in another thread, although if your code runs on x86 you get those for free) to make sure that threads actually see the change. You also need one before the write to avoid reordering writes after it. No just inserting assembly or using a standard compliant volatile (MSVC volatile gives extra guarantees though that make a difference here) is not enough - yes this stops the compiler from shifting reads and writes around, but the CPU is not bound by it and can do the same reordering internally.

Both MSVC and gcc have intrinsics/macros to create memory barriers (see eg here). MSVC also gives stronger guarantees to volatiles that are good enough for your problem. Finally C++11 atomics would work as well, but I'm not sure if C itself has any portable way to guarantee memory barriers.

Snitch answered 10/10, 2013 at 13:21 Comment(4)
Why do I need a memory barrier to prevent reordering of writes? The x86 architecture ensures that writes are done in program order. Agreed, where the code mixes reads and writes I need a memory barrier to prevent reads from passing critical writes.Firdausi
@Ira Yes x86 has very strong ordering requirements. Which means that many memory barriers are a nop for that ISA true. I just think it's better to write correct code for all architectures and let the compiler writers worry about what barriers are implicit for a given architecture and which aren't.Snitch
Time to award for answer. Thanks for your help.Firdausi
For the record, C11 has <stdatomic.h> (en.cppreference.com/w/c/atomic) which provides essentially equivalent functionality to C++11 <atomic>. This lets you write portable code with acquire/release (or seq_cst) semantics, letting the compiler use barrier instructions or not as necessary for the target ISA. Or on ARM, use acquire-load and/or release-store instructions for much better efficiency than a plain load or store + a separate barrier intrinsic. (If you want non-atomic RMW like the and dword ptr [edx],0FFFFFFFEh in the Q, you do need separate atomic load and store)Crotchety
A
4

See _ReadWriteBarrier. That's a compiler intrinsic dedicated to what you are looking for. Be sure to check the documentation against your precise version od MSVC ("deprecated" on VS2012...). Beware cpu reordering (then see MemoryBarrier

The documentation states that the _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics (compiler reordering) and that the MemoryBarrier macro (CPU reordering) are all "deprecated" starting with VS2012. But i think they will continue to work fine for some time...

New code may use the new C++11 facilities (links in the MSDN page)

Affranchise answered 10/10, 2013 at 13:23 Comment(7)
_WriteBarrier is useless because: The _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics prevent compiler re-ordering only.Snitch
I know that, see my mention of cpu reordering? I just answered the OP's question...Affranchise
Yes but: That's a compiler intrinsic dedicated to what you are looking for is patently wrong, because it does not solve the OPs problems, so why mention _WriteBarrier at all then?Snitch
I see your point, believe me. But OP's question" was "So, how do I force the compiler to order these?"Affranchise
+1 This was very useful. In particular, it let me annotate the code in places where such reordering shouldn't happen. I'm not pleased to hear that it is "deprecated" in VS 2012; what do they intend to replace it with?Firdausi
Had to award the answer to somebody, and both yours and @Voo's were just right. Hard choice, awarded to Voo. But I thank for the answer.Firdausi
@IraBaxter No problems. Happy coding!Affranchise
F
1

I would use the volatile keyword. It will prevent the compiler from re-ordering the instructions. http://www.barrgroup.com/Embedded-Systems/How-To/C-Volatile-Keyword

Flattery answered 10/10, 2013 at 13:27 Comment(1)
+1 Thanks, it did help. I found @manuell's answer of the ReadWriteBarrier to be very useful, and less of a sledgehammer.Firdausi

© 2022 - 2024 — McMap. All rights reserved.