Consider the diagrammed data cache architecture. (ASCII art follows.)
--------------------------------------
| CPU core A | CPU core B | |
|------------|------------| Devices |
| Cache A1 | Cache B1 | with DMA |
|-------------------------| |
| Cache 2 | |
|------------------------------------|
| RAM |
--------------------------------------
Suppose that
- an object is shadowed on a dirty line of Cache A1,
- an older version of the same object is shadowed on a clean line of Cache 2, and
- the newest version of the same object has recently been written to RAM via DMA.
Diagram:
--------------------------------------
| CPU core A | CPU core B | |
|------------|------------| Devices |
| (dirty) | | with DMA |
|-------------------------| |
| (older, clean) | |
|------------------------------------|
| (newest, via DMA) |
--------------------------------------
Three questions, please.
If CPU core A tries to load (read) the object, what happens?
If, instead, CPU core A tries to store (write) the object, what happens?
Would anything nonobvious, interesting and/or different happen if, rather than core A, core B did the loading or storing?
My questions are theoretical. My questions do not refer to any particular CPU architecture but you may refer to x86 or ARM (or even RISC-V) in your answer if you wish.
Notes. If disregarding snooping would simplify your answer then you may disregard snooping at your discretion. Alternately, you may modify the problem if a modified problem would better illuminate the topic in your opinion. If you must write code to answer, then I would prefer C/C++. You need not name specific flags of a MESI or MOESI protocol in your answer as far as I know, but a simpler, less detailed answer would probably suffice.
Motive. My motive to ask is that I am reading about concurrency and the memory model in the C++ standard. I would like to learn to visualize this model approximately in terms of hardware operations if possible.
UPDATE
To the extent to which I understand, @HadiBrais advises that the following diagrammed architecture would be more usual than the one I have earlier diagrammed, especially if DDIO (see his answer below) is implemented.
--------------------------------------
| CPU core A | CPU core B | Devices |
|------------|------------| with DMA |
| Cache A1 | Cache B1 | |
|------------------------------------|
| Cache 2 |
|------------------------------------|
| RAM |
--------------------------------------