As @Mackie says, the pipeline will fill with cmp
s. Intel will have to flush those cmp
s when another core writes, which is an expensive operation. If the CPU doesn't flush it, then you have a memory order violation. An example of such a violation would be the below:
(This starts with lock1 = lock2 = lock3 = var = 1)
Thread 1:
spin:
cmp lock1, 0
jne spin
cmp lock3, 0 # lock3 should be zero, Thread 2 already ran.
je end # Thus I take this path
mov var, 0 # And this is never run
end:
Thread 2:
mov lock3, 0
mov lock1, 0
mov ebx, var # I should know that var is 1 here.
First, consider Thread 1:
if cmp lock1, 0; jne spin
branch predicts that lock1 isn't zero, it adds cmp lock3, 0
to the pipeline.
In the pipeline, cmp lock3, 0
reads lock3 and finds out that it equal to 1.
Now, assume Thread 1 is taking its sweet time, and Thread 2 begins running quickly:
lock3 = 0
lock1 = 0
Now, let's go back to Thread 1:
Let's say the cmp lock1, 0
finally reads lock1, finds out that lock1 is 0, and is happy about its branch predicting ability.
This command commits, and nothing gets flushed. Correct branch predicting means nothing is flushed, even with out-of-order reads, since the processor deduced that there is no internal dependency. lock3 isn't dependent on lock1 in the eyes of the CPU, so this all is okay.
Now, the cmp lock3, 0
, which correctly read that lock3 was equal to 1, commits.
je end
is not taken, and mov var, 0
executes.
In Thread 3, ebx
is equal to 0. This should have been impossible. This is the memory order violation that Intel must compensate for.
Now, the solution that Intel takes to avoid that invalid behavior, is to flush. When lock3 = 0
ran on Thread 2, it forces Thread 1 to flush instructions that use lock3. Flushing in this case means that Thread 1 will not add instructions to the pipeline until all instructions that use lock3 have been committed. Before the Thread 1's cmp lock3
can commit, the cmp lock1
must commit. When the cmp lock1
tries to commit, it reads that lock1 is actually equal to 1, and that the branch prediction was a failure. This causes the cmp
to get thrown out. Now that Thread 1 is flushed, lock3
's location in Thread 1's cache is set to 0
, and then Thread 1 continues execution (Waiting on lock1
). Thread 2 now get notified that all other cores have flushed usage of lock3
and updated their caches, so Thread 2 then continues execution (It will have executed independent statements in the meantime, but the next instruction was another write so it probably has to hang, unless the other cores have a queue to hold the pending lock1 = 0
write).
This entire process is expensive, hence the PAUSE. The PAUSE helps out Thread 1, which can now recover from the impending branch mispredict instantly, and it doesn't have to flush its pipeline before branching correctly. The PAUSE similarly helps out Thread 2, which doesn't have to wait on Thread 1's flushing (As said before, I'm unsure of this implementation detail, but if Thread 2 tries writing locks used by too many other cores, Thread 2 will eventually have to wait on flushes).
An important understanding is that while in my example, the flush is required, in Mackie's example, it is not. However, the CPU has no way to know (It doesn't analyze code at all, other than checking consecutive statement dependencies, and a branch prediction cache), so the CPU will flush instructions accessing lockvar
in Mackie's example just as it does in mine, in order to guarantee correctness.