I'd guess that the hang-up is the notion of a "store-buffer". Starting point is the great disparity between the speed of a processor core and the speed of memory. A modern core can easily execute a dozen instructions in a nanosecond. But a RAM-chip can require 150 nanoseconds to deliver a value stored in memory. That is an enormous mismatch, modern processors are filled to the brim with tricks to work around that problem.
Reads are the harder problem to solve, a processor will stall and not execute any code when it needs to wait for the memory sub-system to deliver a value. An important sub-unit in a processor is the prefetcher. It tries to predict what memory locations will be loaded by the program. So it can tell the memory sub-system to read them ahead of time. So physical reads occur much sooner than the logical loads in your program.
Writes are easier, a processor has a buffer for them. Model them like a queue in software. So the execution engine can quickly dump the store instruction into the queue and won't get bogged down waiting for the physical write to occur. This is the store-buffer. So physical writes to memory occur much later than the logical stores in your program.
The trouble starts when your program uses more than one thread and they access the same memory locations. Those threads will run on different cores. Many problems with this, ordering becomes very important. Clearly the early reads performed by the prefetcher causes it to read stale values. And the late writes performed by the store buffer make it worse yet. Solving it requires synchronization between the threads. Which is very expensive, a processor is easily stalled for dozens of nanoseconds, waiting for the memory sub-system to catch up. Instead of threads making your program faster, they can actually make it slower.
The processor can help, store-buffer forwarding is one such trick. A logical read in one thread can pass a physical write initiated by another thread when the store is still in the buffer and has not been executed yet. Without synchronization in the program that will always cause the thread to read a stale value. What store-buffer forwarding does is look through the pending stores in the buffer and find the latest write that matches the read address. That "forwards" the store in time, making it look like it was executed earlier than it will be. The thread gets the actual value; the one that, eventually, ends up in memory. The read no longer passes the write.
Actually writing a program that takes advantage of store-buffer forwarding is rather unadvisable. Short from the very iffy timing, such a program will port very, very poorly. Intel processors have a strong memory model with the ordering guarantees it provides. But you can't ignore the kind of processors that popular on mobile devices these days. Which consume a lot less power by not providing such guarantees.
And the feature can in fact be very detrimental, it hides synchronization bugs in your code. They are the worst possible bugs to diagnose. Micro-processors have been staggering successful over the past 30 years. They however did not get easier to program.