How 'undefined' a race condition can be?
Asked Answered
I

8

6

Let's say I define a following C++ object:

class AClass
{
public:
    AClass() : foo(0) {}
    uint32_t getFoo() { return foo; }
    void changeFoo() { foo = 5; }
private:
    uint32_t foo;
} aObject;

The object is shared by two threads, T1 and T2. T1 is constantly calling getFoo() in a loop to obtain a number (which will be always 0 if changeFoo() was not called before). At some point, T2 calls changeFoo() to change it (without any thread synchronization).

Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 with modern computer architectures and compilers? All the assembler code I investigated so far was using 32-bit memory reads and writes, which seems to save the integrity of the operation.

What about other primitive types?

Practical means that you can give an example of an existing architecture or a standard-compliant compiler where this (or a similar situation with a different code) is theoretically possible. I leave the word modern a bit subjective.


Edit: I can see many people noticing that I should not expect 5 to be read ever. That is perfectly fine to me and I did not say I do (though thanks for pointing this aspect out). My question was more about what kind of data integrity violation can happen with the above code.

Inheritance answered 22/2, 2013 at 8:51 Comment(10)
The last paragraph makes little sense. A practical example of something where a certain outcome is theoretically possible? I suggest you remove one of those two words. :)Mortification
It is theoretically possible that it will happen on every existing architecture with every standard-compliant compiler, because a standard-compliant compiler can choose to do whatever it likes when it encounters UB. In practice, compilers tend to be forgiving about UB, but it is certainly theoretically possible that they do all sorts of other weird things.Mortification
@jalf It doesn't really matter here, since the code won't work as expected (that the reading thread will eventually see 5) on any architecture I know (certainly not with VC++ under Windows, g++ under Linux, or Sun CC under Solaris).Someway
Simply add volatile to your variable to avoid optimization surpirses: volatile uint32_t foo;Interact
@Interact volatile helps compiler optimizations, but doesn't do anything about the read and write pipelines on modern processors. (Arguably, it should, but in reality, it doesn't.)Someway
James Kanze: volatile doesn't help compiler optimization, it hinders it. It directs the compiler that reads and writes cannot be folded / optimized away. The primary purpose is for accessing (memory-mapped) hardware registers.Breaux
Andrew, your question doesn't make a lot of sense. It's perfectly well defined what object code a particular version of a particular compiler for a particular architecture will produce for any given source code. It's just not defined by the language standard. Could your example ever produce a value other than 0 or 5? probably not with existing compilers on eg x86 architecture. But on some other architecture where concurrent access must be regulated, it certainly could. I know of no such architecture.Breaux
@davmac: Good! The existence of such compilers/architectures is actually what I want to know. And perhaps some reasonable comments on practical possibility of their existence in future. Since you say you don't know of such an architecture and since nobody mentioned other examples than 16-bit architectures (which are old enough for me) with every hour it makes me more convinced it's practical for me to use a code like above when C++11 features are out of reach at my company (unfortunately). I think I can take the risk.Inheritance
The reason you're getting so much push back on your question is that there's no good reason to ask this but LOTS of bad reasons to ask it. Nothing about the answer to this question should be used for programming.Heid
When you write a program in C++, you're relying on a contract between yourself and the compiler and language-spec authors: You agree to follow the rules of the language, and in return they agree to convert the source code you wrote into an executable that does what you wrote it to do. It's this contract that gives your program meaning; if you break it, then what are you relying on to give you confidence that your program will work? You can do all the empirical testing you like, but none of that will help you if a new compiler or CPU comes out tomorrow that breaks your bad assumptions.Prediction
M
4

In practice, all mainstream 32-bit architectures perform 32-bit reads and writes atomically. You'll never see anything other than 0 or 5.

Mahlstick answered 22/2, 2013 at 8:57 Comment(13)
That is wrong. The compiler could change your program so that the changes will never be visible to other threads.Pastose
@bamboon: I made two claims. Which of them is false?Mahlstick
Sorry, for being imprecise, I meant the later part. Two threads accessing a variable where one is a write and another is a read results in a race-condition which is UB per standard.Pastose
@bamboon: It's not a race condition if the load and store are atomic, which is exactly what I said in my response.Purify
+1 for a practical answer, not based on any religious beliefs. If you store/read variable in an atomic operation, there is no race condition and no undefined behaviour.Interact
@MarceloCantos You don't say what you mean by "atomically". According to the definition used in the standard, most 32 bit architectures do not perform 32 bit reads and writes atomically.Someway
@JamesKanze: I'm not sure why this phrase is still coming up: "According to the definition used in the standard". The standard does not perfectly define every scenario. This was explicitly a practical question, so the standard + reality should be considered, not one in absence of the other.Purify
@EdS. There are two widely used definitions of "atomic" with regards to memory accesses. The standard uses one (the one most used in modern use). No one is claiming that the access will result in any sort of slicing, with another thread seeing some of the bits modified, and not others (especially as he only modifies the bits of a single byte). But that's only part of the problem on a modern processor. A store access from one thread may never be seen in another thread unless that access is followed by a membar or some sort of fence instruction.Someway
@JamesKanze: Since the OP asked if something other than 0 or 5 might pop out, I'd say they were more concerned with slicing than fences. I agree, though, that the fencing issue needs to be considered.Mahlstick
Related: Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes (none, and an answer shows tearing is possible in practice on AArch64 GCC for an assignment of some constants to uint64_t. I think something similar is unlikely to happen on any real 32-bit machine, though. And definitely not tearing between 0 and 5 except on a DeathStation 9000 that intentionally does something like storing 4 and then incrementing.Caveat
0 and 5 differ only in their low byte, in all 3 signed integer object-representations allowed by the standard (two's complement, one's complement, and sign/magnitude). So even tearing wouldn't produce a different value. Oh, and the question uses uint32_t which is required to be 32-bit with no padding, so you couldn't even have a case where low padding caused the value-bits of 5 to be split across a byte (char) boundary for tearing on a hypothetical machine that loaded each byte separately, perhaps an 8-bit CPU or a DeathStation 9000 on a 16 or 32-bit CPU.Caveat
@JamesKanze: Atomicity (all-or-nothing / lack of tearing) is just one of the things std::atomic gives you. It also gives you well-defined inter-thread visibility, and (optionally) visibility-ordering wrt. operations on other objects. If you look at ACID (en.wikipedia.org/wiki/ACID) in database terminology, they just talk about atomicity as being all-or-nothing. The guaranteed-visible part is separate. I think it's reasonable (in terms of failure-mode analysis for programs with UB) to talk about accesses that would be atomic if not optimized away...Caveat
Of course, more bad things can happen with non-volatile non-atomic int or uint32_t accesses than just tearing: lwn.net/Articles/793253 points out that even if an access isn't optimized away, a compiler is allowed to turn int tmp = shared; into multiple reads of shared which could give different values!! It can invent reads in a way that makes your local variable have two different values.Caveat
U
15

In practice, you will not see anything else than 0 or 5 as far as I know (maybe some weird 16 bits architecture with 32 bits int where this is not the case).

However whether you actually see 5 at all is not guaranteed.

Suppose I am the compiler.

I see:

while (aObject.getFoo() == 0) {
    printf("Sleeping");
    sleep(1);
}

I know that:

  • aObject.foo is equal to 0 (from some prior code)
  • printf cannot change aObject
  • sleep cannot change aObject
  • getFoo does not change aObject (thanks inline definition)

And therefore I can safely transform the code:

while (true) {
    printf("Sleeping");
    sleep(1);
}

Because there is no-one else accessing aObject during this loop, according to the C++ Standard.

That is what undefined behavior means: blown up expectations.

Ukase answered 22/2, 2013 at 9:14 Comment(26)
And then it really boils down to "are you willing to gamble your program's correctness on the assumption that the compiler is not, and will never be, clever enough to make this optimization". :)Mortification
But the question is "Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 with modern computer architectures and compilers?", and the answer is "no", so long as the load and store are atomic.Purify
@EdS.: and as long as neither of the two are optimized out (because otherwise they won't reach the CPU at all, and then it doesn't matter if they'd be executed atomically or not), but as this answer points out, the optimizer is allowed to optimize it outMortification
@jalf: No, optimized or not, the question is "Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 ". He didn't ask "will the other thread ever read a 5, or will it always read 0". He is asking whether or not he will ever read some intermediate value.Purify
@EdS.: you might not like it, but this answer brings up a very important point, in that it explains how this code could fail to work practically speaking. It gives a plausible reason why a real compiler might break this code, without resorting to "nasal demons" or "ordering pizza" or "the compiler hates you and just wants to spite you". This answer points out why the optimizer might break the code in its quest to make your code fast. I think that's pretty significant. Because it means that even if every compiler today accepts the code, the next version might failMortification
@jalf: I'm not saying it isn't interesting or informative, I'm saying it does not address the question that the OP actually askedPurify
@EdS.: true enough. But he clearly wrote that question under the assumption that "both the values 0 and 5 will be seen", and they might not beMortification
@jalf: ...which is why he asked :DPurify
@EdS.: no, as you are so keen to point out, he asked something else. He assumes that "oh, we'll certainly get the values 0 and 5 eventually", so he instead asked "but will we also get other intermediate values". And no, we (practically speaking) won't, but his premise is flawed, because we might never see the value 5 either. So you're right, it doesn't technically speaking answer the question, but it answers what he needs to know. Because the assumption that led to the question is wrong.Mortification
@Matthieu: I guess adding volatile qualifier would oblige the compiler not to make such an optimization?Inheritance
@jalf: No, it gives great technical insight, and not a single yes or no to the answer, which you assume comes from a flawed premise. It's great to point out that the compiler can optimize away the assignment, but that information should come in addition to answering the question.Purify
@Andrew: I am too savvy about volatile but I think that yes it would prevent this optimization. However volatile was originally meant for hardware access, and therefore there may be issues in the context of multithread related to the order in which writes become apparent to another thread. TL;DR: just use std::atomic<int> and let the compiler optimize.Ukase
@Inheritance it would (writes to a volatile variable count as observable behavior, so they may not be optimized away and may not be reordered. But it is overkill. A memory fence or an std::atomic would be more appropriate (and no more expensive)Mortification
@EdS.: Right, a more direct answer would definitely help. I added a couple sentences at the top to make it clear what is answered by this question.Ukase
@MatthieuM. its perfectly OK to use volatile in context of multithreading, even MSDN mentions that: >The volatile keyword is a type qualifier used to declare that an object can be modified in the program by something such as the operating system, the hardware, or a concurrently executing thread.Interact
P.S. I'm not saying volatile is the best solution here, but it's one of options.Interact
@Interact volatile as a sometimes-applicable synchronization primitive in concurrent contexts is a MS extension. using volatile as a synchronization primitive is not portable or recommended for modern programs.Prunelle
printf cannot change aObject - only true if aObject is a non-escaped local variable. If it's globally reachable (which it has to be for threading to even be a consideration), real implementations don't try to special-case printf as being anything like "pure" (or in GCC inline asm terms, lacking a "memory" clobber like a normal non-inline function call). With functions like setvbuf, you can give printf pointers to your program's variables so such an optimization wouldn't even be safe. So printf wasn't a good example. :/Caveat
in your example, the program doesn't know that getfoo will return zero, so it has to take at least one more step of the loop is either never run or an infinite loop.Heid
@xaxxon: No. And please do NOT edit an answer to change its content on your whim. I purposefully left it out after seeing your comment because I think it's extra clutter that is unnecessary to convey the crux of this answer. Nobody seemed to think it odd for 11 years, either. This is NOT an optimization course, and too many pedantic details will detract from the main point (optimizing out the check).Ukase
Your actual answer is wrong, that's why I edited it. " therefore I can safely transform the code" No, you can't. It is an incorrect transformation, which is why I correctly changed it. Your transformation is WRONG. It is NOT a valid optimization of the code listing above it that it purports to be.Heid
@MatthieuM. And it was not "on a whim", so I don't appreciate you being condescending as such.Heid
@xaxxon: I am not being condescending, I am being annoyed at your edit. One of the main guidelines for edition is to respect the author's intent, and your edit didn't. I purposefully favored a short snippet to avoid distracting from the crux of the answer. You couldn't know that -- unless you're a mind reader -- which is precisely the point of NOT making such edits in the first place.Ukase
Well, if anyone wants the correct version of this answer, it's available here: stackoverflow.com/revisions/15020548/3 The answer as posted is wrong.Heid
@xaxxon: Actually, following your earlier tantrum, I've added an item to the list of "checks" before the transformation. So while I can't promise the answer is right, it's at least not suffering from the problem you described early.Ukase
You've made this answer worse. Also, you're being condescending again. These are not "whims" or "tantrums". And this change makes your code worse making it look like you don't have to write the code for it to know that it's zero. That is new code that you have to add during the transformation, not code that already exists. It's a check your code removed.Heid
M
4

In practice, all mainstream 32-bit architectures perform 32-bit reads and writes atomically. You'll never see anything other than 0 or 5.

Mahlstick answered 22/2, 2013 at 8:57 Comment(13)
That is wrong. The compiler could change your program so that the changes will never be visible to other threads.Pastose
@bamboon: I made two claims. Which of them is false?Mahlstick
Sorry, for being imprecise, I meant the later part. Two threads accessing a variable where one is a write and another is a read results in a race-condition which is UB per standard.Pastose
@bamboon: It's not a race condition if the load and store are atomic, which is exactly what I said in my response.Purify
+1 for a practical answer, not based on any religious beliefs. If you store/read variable in an atomic operation, there is no race condition and no undefined behaviour.Interact
@MarceloCantos You don't say what you mean by "atomically". According to the definition used in the standard, most 32 bit architectures do not perform 32 bit reads and writes atomically.Someway
@JamesKanze: I'm not sure why this phrase is still coming up: "According to the definition used in the standard". The standard does not perfectly define every scenario. This was explicitly a practical question, so the standard + reality should be considered, not one in absence of the other.Purify
@EdS. There are two widely used definitions of "atomic" with regards to memory accesses. The standard uses one (the one most used in modern use). No one is claiming that the access will result in any sort of slicing, with another thread seeing some of the bits modified, and not others (especially as he only modifies the bits of a single byte). But that's only part of the problem on a modern processor. A store access from one thread may never be seen in another thread unless that access is followed by a membar or some sort of fence instruction.Someway
@JamesKanze: Since the OP asked if something other than 0 or 5 might pop out, I'd say they were more concerned with slicing than fences. I agree, though, that the fencing issue needs to be considered.Mahlstick
Related: Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes (none, and an answer shows tearing is possible in practice on AArch64 GCC for an assignment of some constants to uint64_t. I think something similar is unlikely to happen on any real 32-bit machine, though. And definitely not tearing between 0 and 5 except on a DeathStation 9000 that intentionally does something like storing 4 and then incrementing.Caveat
0 and 5 differ only in their low byte, in all 3 signed integer object-representations allowed by the standard (two's complement, one's complement, and sign/magnitude). So even tearing wouldn't produce a different value. Oh, and the question uses uint32_t which is required to be 32-bit with no padding, so you couldn't even have a case where low padding caused the value-bits of 5 to be split across a byte (char) boundary for tearing on a hypothetical machine that loaded each byte separately, perhaps an 8-bit CPU or a DeathStation 9000 on a 16 or 32-bit CPU.Caveat
@JamesKanze: Atomicity (all-or-nothing / lack of tearing) is just one of the things std::atomic gives you. It also gives you well-defined inter-thread visibility, and (optionally) visibility-ordering wrt. operations on other objects. If you look at ACID (en.wikipedia.org/wiki/ACID) in database terminology, they just talk about atomicity as being all-or-nothing. The guaranteed-visible part is separate. I think it's reasonable (in terms of failure-mode analysis for programs with UB) to talk about accesses that would be atomic if not optimized away...Caveat
Of course, more bad things can happen with non-volatile non-atomic int or uint32_t accesses than just tearing: lwn.net/Articles/793253 points out that even if an access isn't optimized away, a compiler is allowed to turn int tmp = shared; into multiple reads of shared which could give different values!! It can invent reads in a way that makes your local variable have two different values.Caveat
S
4

Not sure what you're looking for. On most modern architectures, there is a very distinct possibility that getFoo() always returns 0, even after changeFoo has been called. With just about any decent compiler, it's almost guaranteed that getFoo(), will always return the same value, regardless of any calls to changeFoo, if it is called in a tight loop.

Of course, in any real program, there will be other reads and writes, which will be totally unsynchronized with regards to the changes in foo.

And finally, there are 16 bit processors, and there may also be a possibility with some compilers that the uint32_t isn't aligned, so that the accesses won't be atomic. (Of course, you're only changing bits in one of the bytes, so this might not be an issue.)

Someway answered 22/2, 2013 at 9:4 Comment(1)
+1 here, it does answer the question (I didn't see that you posted an answer during all of the commenting).Purify
P
3

Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5 with modern computer architectures and compilers? What about other primitive types?

Sure - there is no guarantee that the entire data will be written and read in an atomic manner. In practice, you may end up with a read which occurred during a partial write. What may be interrupted, and when that happens depends on several variables. So in practice, the results could easily vary as size and alignment of types vary. Naturally, that variance may also be introduced as your program moves from platform to platform and as ABIs change. Furthermore, observable results may vary as optimizations are added and other types/abstractions are introduced. A compiler is free to optimize away much of your program; perhaps completely, depending of the scope of the instance (yet another variable which is not considered in the OP).

Beyond optimizers, compilers, and hardware specific pipelines: The kernel can even affect the manner in which this memory region is handled. Does your program Guarantee where the memory of each object resides? Probably not. Your object's memory may exist on separate virtual memory pages -- what steps does your program take to ensure the memory is read and written in a consistent manner for all platforms/kernels? (none, apparently)

In short: If you cannot play by the rules defined by the abstract machine, you should not use the interface of said abstract machine (e.g. you should just understand and use assembly if the specification of C++'s abstract machine is truly inadequate for your needs -- highly improbable).

All the assembler code I investigated so far was using 32-bit memory reads and writes, which seems to save the integrity of the operation.

That's a very shallow definition of "integrity". All you have is (pseudo-)sequential consistency. As well, the compiler needs only to behave as if in such a scenario -- which is far from strict consistency. The shallow expectation means that even if the compiler actually made no breaking optimization and performed reads and writes in accordance with some ideal or intention, that the result would be practically useless -- your program would observe changes typically 'long' after its occurrence.

The subject remains irrelevant, given what specifically you can Guarantee.

Prunelle answered 22/2, 2013 at 9:38 Comment(7)
The last sentence is the very important one.Pastose
@justin: I can't agree with the last paragraph: Imagine a case (and this was actually the root of my question) where I have several threads that perform tasks in a loops and they count (=>write) the number of executions just for the statistics and I want to display (=>read) this number to see statistics. It doesn't statistically matter to me whether the number is 1000000 or 1000042. If unsynchronized write-read can possibly make the program crash, it IS an issue for me. But if it just makes me see fake number, I can live with it. So the result COULD be practically useful.Inheritance
@Inheritance How does that make sense? Why would you want to create statistics which could be totally wrong. What would the purpose be of creating false data? They could potentially differ by numbers much larger than 42.Pastose
@Inheritance most obvious case in that scenario: since you are not guaranteed strict consistency, read and write operations to the shared counter are ultimately divisible. therefore, the accumulator value will be exact only in very specific scenarios (scenarios you cannot guarantee using the abstract machine in combination with sequential consistency alone). illustration: say you have 15 threads and they all perform 1 million operations/increments, sharing one counter - consider it a miracle (read: purely chance) if your program's operation counter consistently reaches 15 million.Prunelle
@Inheritance that also means that the accuracy of the counter can vary dramatically as the compilers, hardware, optimizations, machine's workload during execution, etc. changes. so it could be 99% accurate in one run, and 96% accurate the next time you run it.Prunelle
@bamboon, sure they can but it's me (human who reads the statistic) to decide what to do with this knowledge. None of the part of the program's execution depends on this value.Inheritance
Another usefulness example: I want to refresh a cashed value no more often that 1 mln loops (loops are done by multiple threads). Even if the actual refresh happens after 50 mln it is still acceptable (though I can start thinking of changing the OS because its scheduler apparently sucks).Inheritance
P
2

In practice (for those who did not read the question), any potential problem boils down to whether or not a store operation for an unsigned int is an atomic operation which, on most (if not all) machines you will likely write code for, it will be.

Note that this is not stated by the standard; it is specific to the architecture you are targeting. I cannot envision a scenario in which a calling thread will red anything other than 0 or 5.

As to the title... I am unaware of varying degrees of "undefined behavior". UB is UB, it is a binary state.

Purify answered 22/2, 2013 at 8:56 Comment(26)
That is wrong. The compiler could change your program so that the changes will never be visible to other threads.Pastose
@bamboon: Care to explain how?Purify
Two threads accessing a variable where one is a write and another is a read results in a race-condition which is UB per standard.Pastose
@bamboon: Umm, no, it's not. That is likely an atomic store on a 32-bit integer, i.e., no race condition. That's what I said, and it is correctPurify
It is, see standard 1.11.21. (using N3376 here)Pastose
First, the access to foo is not atomic in the sense the standard uses the term. Second, while I can't imagine a scenario where the reading thread would see anything but 0 or 5, I can imainge quite a few (including some very, very likely) where the reading thread will never see anything but 0, regardless of how many times changeFoo is called.Someway
@JamesKanze: Yet in practice, your x86 chip will always load and store that value atomically. You skipped over an important part of the question. So, while the technical, standard-only answer is interesting, it is not what the OP asked for. He asked about what could happen in practice.Purify
@EdS.: You are forgetting something very important, the As If behavior. The compiler is entitled to produce whatever code it wants as long as the behavior is as if it had produced a literal translation; the rules that determine whether it's as if or not are given by the Standard, and whenever you read undefined behavior it means the compiler can suppose it will never happen. Therefore, it can lift getFoo out of the loop... if it can prove no call to changeFoo ever happens within this thread. And then you are damned.Ukase
@MatthieuM.: That's a good point, but again, the OP asked if another thread can ever see anything but a 0 or a 5. I can't imagine a scenario in which one could. I'm not saying I know everything and there's nothing I haven't considered, but that's my answer.Purify
@EdS. No. Not even in practice. At least not for the definition of atomic used in the standard (which is the only useful definition in a multithreaded world).Someway
@JamesKanze: And again you are ignoring the question. This was not a theoretical inquiry, it was specifically asking about practical applications. In that context my answer is correct as far as I am aware.Purify
@EdS.: he was asking about "practical applications" where another outcome is "theoretically possible". :)Mortification
@EdS. You're ignoring reality. A store operation of an int isn't "atomic" on a Sparc, nor as far as I know on an Intel, or any other modern processor with a memory pipeline or a cache.Someway
@jalf: Sure, but in practice, it isn't (as long as the load and store are atomic). What I don't get is the answers stating only "UB, anything can happen herp derp". They're not helpful in this context.Purify
@JamesKanze: it certainly is on Intel CPUs (assuming the int is properly aligned). Regardless of pipelining and cache, every core will always either see the write has having been performed, or not having been performed. No inbetween states. That's guaranteed by the ISA :)Mortification
@JamesKanze: You apparently didn't read my response either. Kudos. And I quote: Any potential problem boils down to whether or not a store operation for an unsigned int is an atomic operation which, on most (if not all) machines you will likely write code for, it will be". So yeah, I already covered that.Purify
@EdS. And you're ignoring what I'm saying: that a store operation for an unsigned int is not an atomic operation (in the most widely used sense of atomic today) on a modern Intel or Sparc. (I'm less familiar with other processors.) For it to be atomic, you need additional instructions (or a LOCK prefix on an Intel).Someway
@jalf Not according to the Intel documentation. The Intel processor has fence instructions for a good reason.Someway
@JamesKanze: ...Not sure where you're getting your information, or perhaps you have a wildly exotic definition for "atomic". Download this, chapter 8. According to it, atomic store operations include: storing a byte, storing word-aligned word and dword-aligned dword on all x86 CPUs. I honestly don't know what you are talking about.Purify
@JamesKanze: As I said before, there's nothing wrong with going into more detail, but what is the answer to "Is there any practical chance that the values ever obtained by T1 will be different than 0 or 5". The answer is "no", which is all that I have ever said through these comments. Ok, I have got to go to bed now, it's late. I will say this; The OP has a wealth of information to go over.Purify
@JamesKanze yes it does, but that has nothing to do with atomicity. word-sized reads and writes are guaranteed to be atomic. But they are not all guaranteed to be performed in order, unless you insert these fence instructions. Those are completely separate issues. Fences are for forcing ordering between instructions, it does nothing about atomicity. Atomicity is guaranteed for any well-aligned word-sized read/write.Mortification
@EdS. They seem to be contradicting themselves (or using a very weak definition of "atomic"); in chapter 8.2, they say more or less exactly what I've been saying, that accesses are not guaranteed to be atomic, in the sense the word is used in the C++ standard, and that it is usually used when talking about multithreaded issues. (There is a clear semantic shift here, because atomic didn't used to have this stronger meaning.)Someway
@jalf In the C++ standard, and in most recent books about threading, "atomic" is used to imply ordering (and guaranteed execution). This is a clear semantic shift; it would probably have been better if some other term had been used, but ordering is part of the modern meaning of "atomic". In particular, it's quite clear that unless there is a fence instruction, other threads are not guaranteed to see your write.Someway
@EdS. Anyway, I think our disagreement is more about the meaning of the word "atomic", than what actually happens at a hardware level. (Since you've posted a link to the same document I base my claims on.) And in practice: the reading process will never see anything but 0 or 5, but it may never see the 5. And also, on a PC, there's enough other stuff going on that the OS will eventually do a context switch, in which case, it will emit a fence instruction, so the other thread will eventually see a 5 (but maybe only seconds after it was written).Someway
@JamesKanze: what. I'm sorry, but no. The std::atomic template has broader semantics than just guaranteeing atomicity, but that does not change the semantic of the word "atomicity" in general. Where does the C++ standard define the concept of something being "atomic" (without talking about std::atomic specifically) the way you describe?Mortification
@jalf The name C++ uses was chosen because the semantics correspond to the most widely used meaning of atomic today (at least with regard to memory access).Someway
R
1

A function such as:

unsigned short test(unsigned short *p)
{
  unsigned short temp = *p;
  return temp - (temp >> 15);
}

would always return a value from 0-65534 when passed a pointer to any valid storage, regardless of how outside code might be accessing it at the time, on an implementation which, as a form of "conforming language extension", specified that every read of an 8, 16, or 32-bit value from a valid address will always yield a value of its type (though not necessarily a meaningful one), even in the presence of race conditions. Compilers that don't offer such guarantees, however, may generate code that could sometimes yield an "impossible" value like 65535 in the presence of a race condition. Indeed, gcc-ARM will do precisely that when given the above function and targeting the Cortex-M0. When targeting that platform, its optimizer will generate code equivalent to:

unsigned short test(unsigned short *p)
{
  unsigned temp1 = *p;
  unsigned temp2 = *(signed short*)p;
  return (temp1 + (temp2 >> 15)) & 0xFFFFu;
}

The Cortex-M0 is a modern platform, and the above is a concrete example of a compiler for such a platform generating "optimized" code which would malfunction in a manner that straightforwardly-processed code for that same platform never would.

Rauch answered 16/1 at 23:23 Comment(0)
C
0

There are lots of ways for a DeathStation 9000 C++ implementation to break your program, e.g. by compiling foo = 5 into a store of 4 and then a memory increment, so a value is visible that never existed in the abstract machine. But that doesn't seem plausible on any real compiler anyone would want to use, except maybe if compiling shared = tmp ? 1234 : 5678; it could unconditionally store a 1234 then conditionally store a 5678 instead of doing an ALU select. Probably still unlikely.

One major real-world effect that can't go without mentioning is hoisting a non-atomic load (or sinking a store) out of a loop. MCU programming - C++ O2 optimization breaks while loop explains the details. But that would just stop you from ever seeing the store, not give you a value other than 0 or 5.

Other practical effects include int tmp = shared; compiling later uses of tmp into re-reads of shared, so effectively your local variable can have multiple inconsistent values. See the "Invented loads" section in Who's afraid of a big bad optimizing compiler? on LWN. (The context of that article is Linux kernel programming, where they use GCC/Clang's semantics for volatile (via WRITE_ONCE or READ_ONCE macros) as basically equivalent to std::atomic<> with memory_order_relaxed, for types of register-width or narrower.)

Definitely read that whole article; it's written from exactly the perspective you're looking for, describing real-world badness that can plausibly happen on real CPUs with real compilers. It also has a section about invented stores with a more plausible example than my Deathstation 9000 first paragraph, involving multiple small members of a struct.


But you asked about practical cases that could cause tearing or bad effects other than lack of visibility across one single change, with CPUs and compilers that are used in practice. Tearing between 0 and 5 is implausible1, but let's talk about 0 and 0x12345678.

Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes - none are guaranteed; Nate's answer shows possible tearing for assigning certain constants to uint64_t on AArch64 before ARMv8.4, and that you can coax a compiler into doing an unaligned 16-bit load from the middle of a 32-bit unsigned integer just by reading the whole variable and doing some innocent-looking computations on the temporary result:

unsigned x;

unsigned foo(void) {
    return (x >> 8) & 0xffff;
}

For x86-64, compilers use movzx eax, WORD PTR x[rip+1], which is safe: both AMD and Intel separately guarantee that 1, 2, or 4-byte unaligned loads / stores on cacheable memory contained within an aligned 8-byte chunk are guaranteed atomic. (See Why is integer assignment on a naturally aligned variable atomic on x86?). And both mainstream x86-64 ABIs have alignof(unsigned) == 4.

But that's not guaranteed on all other ISAs, such as ARMv7-M or ARMv8-A. Unaligned loads are supported, and clang will use them (Godbolt), but the architecture only guarantees atomicity for aligned loads, e.g. "All halfword accesses to halfword-aligned locations.". Instead of putting significant shift hardware into their load execution units like x86 CPUs are required to do, ARM chips are allowed if they want to do separate byte loads and combine the bytes. From the ARMv8-A architecture reference manual, All other memory accesses are regarded as streams of accesses to bytes, and no atomicity between accesses to different bytes is ensured by the architecture.

According to a blog post, fallback to separate byte loads is in practice what happened on real ARMv6 hardware. I'm not sure if later low-power ARMs still do that. Maybe not, since compiler tuning heuristics are willing to use unaligned loads for them. (Even -march=armv6t2 -mtune=cortex-a53 gets clang to avoid unaligned loads, Godbolt, but that might be due to ARMv6 allowing a config choice so unaligned loads use the offset-within-word bits as a byte rotate count like ARMv5(!) or undefined for halfword, not as a normal byte-offset like ARMv7.)

But we can get a compiler to emit code that's not guaranteed on paper to be atomic on the multi-core -mcpu we asked it to compile for, such as Cortex-M55 which is ARMv8.1-M and has dual-core versions available, or Cortex-A53 which is ARMv8-A and is often found in multi-core CPUs. (Godbolt)

# ARM clang 11  -O2  -Wall -mcpu=cortex-a53 or -mcpu=cortex-m55
foo():
        movw    r0, :lower16:x
        movt    r0, :upper16:x
        ldrh.w  r0, [r0, #1]     @@@ unaligned load from x+1
        bx      lr

Similar with AArch64 Clang. I don't have the hardware to test if tearing is actually visible in practice. I suspect it might not be on Cortex-A53, otherwise it might have been better for Clang to compile like GCC to a word load and ubfx unsigned bitfield-extract. If tearing is possible, that means it took extra cycles to do multiple accesses to L1d cache even within an aligned word.

The architecture manual says tearing is possible for any misaligned halfword load/store, but on some cores it probably only happens when crossing a cache-line boundary for cacheable load/store, or maybe a 4 or 8-byte boundary depending on the CPU internals.


Related Q&As:


Footnote 1: 0 and 5 differ only in the low byte.

Tearing between 0 and 5 in a uint32_t is pretty much impossible even on an 8-bit machine: the only HW guarantee required is that byte stores are atomic.

  • 0u and 5u differ only in the low 3 bits; their top 3 octets are the same.
    (This would also be the case for signed 0 or 5 in any of the three signed-integer representations allowed by the standard, two's complement, one's complement, and sign/magnitude.)

  • uint32_t is required to be exactly 32 bits, zero padding, so low padding can't split the low 3 value bits across a byte boundary. Any endianness is possible, but within each unsigned char chunk (of at least 8 bits, the minimum CHAR_BIT) the bits have to be in base-2 place-value order for any unsigned type. (It is well-defined to use memcpy or unsigned char* to examine the object-representation of other types.)

  • So tearing at byte boundaries couldn't create values other than 0u or 5u. Obviously with values like 0 and 0x12345678u such tearing would be visible.

All CPU architectures I'm aware of have atomic byte load / byte store, if they have byte accesses at all (not DEC Alpha). See Can modern x86 hardware not store a single byte to memory? (which covers more than just x86). Commit to L1d cache might involve an extra cycle on non-x86 CPUs to update a larger ECC granule, but it's like an atomic RMW.

Caveat answered 13/1 at 9:35 Comment(1)
As noted in my answer, the gcc compiler for the ARM Cortex-M0 will generate code for 16-bit loads whose race-condition behavior would sometimes be inconsistent not only with any bit pattern the storage has ever held, but also with any bit pattern the storage would even be capable of holding, even if at the hardware level all loads and stores would be atomic.Rauch
R
-1

Undefined behavior is guaranteed to be as undefined as the word undefined.
Technically, the observable behavior is pointless because it is simply undefined behavior, the compiler is not needed to show you any particular behavior. It may work as you think it should or not or may burn your computer, anything and everything is possible.

Rivard answered 22/2, 2013 at 8:54 Comment(5)
I don't remember any specs saying that 'race condition' implies 'undefined behavior' in a way the term is used by the specs. Or does it?Inheritance
@Inheritance Unfortunately, yes. [intro.multithread]§21: The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.Microcyte
I have trouble finding this part in pre-C++11 specifications. It means that your answer is correct only for those compilers that comply to C++11 (and provided that I activate C++11 features like in gcc). For those that don't, this is not undefined behavior. But maybe I took a wrong document, did I?Inheritance
@Inheritance Before C++11 the standard wasn't aware of multithreading at all which means that basically any multithreading was UB or lets rather say implementation defined to the specific implementation you were using(e.g. pthreads).Pastose
@Andrew: Pre C++11 standards do not have any mention of multithreading. Pre C++11 standards provide some guarantees on simultaneous access for objects but not in a elaborated way. The answer though is true for both pre C++11 as well as C++11 compliant compilers.Rivard

© 2022 - 2024 — McMap. All rights reserved.