A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics implemented on x86 and ARM micro architecturally ?
For store-store barriers, it seems like on the x86, the store buffer maintains program order of stores and commits them to L1D(and hence making them globally visible in the same order). If the store buffer is not ordered, ie does not maintain them in program order, how is a store store barrier implemented ? it is just "marking" the store buffer in such a way that that stores before barrier commit to the cache coherent domain before stores after ? or does the memory barrier actually flush the store buffer and stall all instructions until the flushing is complete ? Could it be implemented both ways ?
For load-load barriers, how is load-load reordering prevented ? It is hard to believe that x86 will execute all loads in order! I assume loads can execute out of order but commit/retire in order. If so, if a cpu executes 2 loads to 2 different locations ,how does one load ensure that it got a value from say T100 and the next one got it on or after T100 ? What if the first load misses in the cache and is waiting for data and the second load hits and gets its value. When load 1 gets its value how does it ensure that the value it got is not from a newer store that load 2's value ? if the loads can execute out of order, how are violations to memory ordering detected ?
Similarly how are load-store barriers(implicit in all loads for x86) implemented and how are store-load barriers(such as mfence) implemented ? ie what do the dmb ld/st and just dmb instructions do micro-architecturally on ARM, and what does every load and every store, and the mfence instruction do micro-architecturally on x86 to ensure memory ordering ?