Short Answer : Cache coherency works most of the time but not always. You can still read stale data. If you don't want to take chances, then just use a memory barrier
Long Answer : CPU core is no longer directly connected to the main memory. All loads and stores have to go through the cache. The fact that each CPU has its own private cache causes new problems. If more than one CPU is accessing the same memory it must still be assured that both processors see the same memory content at all times. If a cache line is dirty on one processor (i.e., it has not been written back yet to main memory) and a second processor tries to read the same memory location, the read operation cannot just go out to the main memory. . Instead the content of the first processor’s cacheline is needed. The question now is when does this cache line transfer have to happen? This question is pretty easy to answer: when one processor needs a cache line which is dirty in another processor’s cache for reading or writing. But how can a processor determine whether a cache line is dirty in another processor’s cache? Assuming it just because a cache line is loaded by another processor would be suboptimal (at best). Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty. Here comes cache coherency protocols. CPU's maintain data consistency across their caches via MESI or some other cache coherence protocol.
With cache coherency in place, should we not see that latest value always for the cacheline even if it was modified by another CPU? After all that is whole purpose of the cache coherency protocols. Usually when a cacheline is modified, the corresponding CPU sends an "invalidate cacheline" request to all other CPU's. It turns out that CPU’s can send acknowledgement to the invalidate requests immediately but defer the actual invalidation of the cacheline to a later point in time. This is done via invalidation queues. Now if we get un-lucky enough to read the cacheline within this short window (between the CPU acknowledging an invalidation request and actually invalidating the cacheline) then we can read a stale value. Now why would a CPU do such a horrible thing. The simple answer is PERFORMANCE. So lets look into different scenarios where invalidation queues can improve performance
Scenario 1 : CPU1 receives an invalidation request from CPU2. CPU1 also has a lot of stores and loads queued up for the cache. This means that the invalidation of the requested cacheline takes times and CPU2 gets stalled waiting for the acknowledgment
Scenario 2 : CPU1 receives a lot of invalidation requests in a short amount of time. Now it takes time for CPU1 to invalidate all the cachelines.
Placing an entry into the invalidate queue is essentially a promise by the CPU to process that entry before transmitting any MESI protocol messages regarding that cache line. So invalidation queues are the reason why we may not see the latest value even when doing a simple read of a single variable.
Now the keen reader might be thinking, when the CPU wants to read a cacheline, it could scan the invalidation queue first before reading from the cache. This should avoid the problem. However the CPU and invalidation queue are physically placed on opposite sides of the cache and this limits the CPU from directly accessing the invalidation queue. (Invalidation queues of one CPU’s cache are populated by cache coherency messages from other CPU’s via the system bus. So it kind of makes sense for the invalidation queues to be placed between the cache and the system bus). So in order to actually see the latest value of any shared variable, we should empty the invalidation queue. Usually a read memory barrier does that.
I just talked about invalidation queues and read memory barriers. [1] is a good reference for understanding the need for read and write memory barriers and details of MESI cache coherency protocol
[1] http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf
volatile
keyword in C and C++ does nothing for thread synchronization (don't remember about C#). Memory barriers do enforce cache coherence. You might want to read up on strong / weak memory models, and memory ordering. – Pharmacopsychosis