How do data caches route the object in this example?
Asked Answered
A

1

4

Consider the diagrammed data cache architecture. (ASCII art follows.)

  --------------------------------------
  | CPU core A | CPU core B |          |
  |------------|------------| Devices  |
  |  Cache A1  |  Cache B1  | with DMA |
  |-------------------------|          |
  |         Cache 2         |          |
  |------------------------------------|
  |                RAM                 |
  --------------------------------------

Suppose that

  • an object is shadowed on a dirty line of Cache A1,
  • an older version of the same object is shadowed on a clean line of Cache 2, and
  • the newest version of the same object has recently been written to RAM via DMA.

Diagram:

  --------------------------------------
  | CPU core A | CPU core B |          |
  |------------|------------| Devices  |
  |  (dirty)   |            | with DMA |
  |-------------------------|          |
  |     (older, clean)      |          |
  |------------------------------------|
  |          (newest, via DMA)         |
  --------------------------------------

Three questions, please.

  1. If CPU core A tries to load (read) the object, what happens?

  2. If, instead, CPU core A tries to store (write) the object, what happens?

  3. Would anything nonobvious, interesting and/or different happen if, rather than core A, core B did the loading or storing?

My questions are theoretical. My questions do not refer to any particular CPU architecture but you may refer to x86 or ARM (or even RISC-V) in your answer if you wish.

Notes. If disregarding snooping would simplify your answer then you may disregard snooping at your discretion. Alternately, you may modify the problem if a modified problem would better illuminate the topic in your opinion. If you must write code to answer, then I would prefer C/C++. You need not name specific flags of a MESI or MOESI protocol in your answer as far as I know, but a simpler, less detailed answer would probably suffice.

Motive. My motive to ask is that I am reading about concurrency and the memory model in the C++ standard. I would like to learn to visualize this model approximately in terms of hardware operations if possible.

UPDATE

To the extent to which I understand, @HadiBrais advises that the following diagrammed architecture would be more usual than the one I have earlier diagrammed, especially if DDIO (see his answer below) is implemented.

  --------------------------------------
  | CPU core A | CPU core B | Devices  |
  |------------|------------| with DMA |
  |  Cache A1  |  Cache B1  |          |
  |------------------------------------|
  |              Cache 2               |
  |------------------------------------|
  |                RAM                 |
  --------------------------------------
Aguedaaguero answered 11/2, 2019 at 19:9 Comment(4)
What do you mean by "shadowed"? Also, your example suggests that you are assuming that the DMA is non-coherent. Just to be clear, is this intentional?Panathenaea
@HadiBrais To clarify the level from which the question is pitched: I am a Debian Developer whose education is in electrical engineering. My professional field is not computers but building construction, so I am mostly self-trained in computer matters. If "shadowed" is the wrong word, please do correct me! Regarding DMA, I had indeed been assuming that the DMA is non-coherent, but PCI and such are not well known to me. I mentioned DMA only to simplify the problem. (Otherwise, the problem might have needed four CPU cores and a three-level cache, unnecessarily complicating.)Aguedaaguero
Specifically to answer your first question, by "to shadow X," I mean "temporarily to keep an imperfectly synchronized, possibly modifiable copy of X for local use and, if necessary, for later flushing."Aguedaaguero
Modern x86 has cache-coherent DMA. I think this became a thing when x86 CPUs started putting the memory controller on-chip so snooping the L3 tags on the way to memory became practical (and Intel's inclusive L3 tags work as a snoop filter for the private per-core caches). Having coherent DMA probably makes aggressive hardware prefetch easier to implement without worrying about creating decoherence in odd corner cases of branch mispredicts leading to unwanted speculative loads from just-flushed memory.Mattah
P
4

Your hypothetical system seems to include coherent, write-back L1 caches and non-coherent DMA. A very similar real processor is ARM11 MPCore, except that it doesn't have an L2 cache. However, most modern processors do have coherent DMA. Otherwise, it is the software's responsibility to ensure coherence. The state of the system as shown in your diagram is already incoherent.

If CPU core A tries to load (read) the object, what happens?

It will just read the line held in its local L1 cache. No changes will occur.

If, instead, CPU core A tries to store (write) the object, what happens?

The lines is already in the M coherence state in the L1 cache of core A. So it can write to it directly. No changes will occur.

Would anything nonobvious, interesting and/or different happen if, rather than core A, core B did the loading or storing?

If core B issued a load request to the same line, the L1 cache of core A is snooped and the line is found in the M state. The line is updated in the L2 cache and is sent to the L1 cache of core B. Also one of the following will occur:

  • The line is invalidated from core A's L1 cache. The line is inserted in core B's L1 cache in the E coherence state (in case of the MESI protocol) or the S coherence state (in case of the MSI protocol). If the L2 uses a snoop filter, the filter is updated to indicate that core B has the line in the E/S state. Otherwise, the state of the line in the L2 will be the same as that in core B's L1, except that it doesn't know that it is there (so snoops will have to broadcasted always).
  • The state of the line in core A's L1 cache is changed to S. The line is inserted in core B's L1 cache in the S coherence state. The L2 inserts the line in the S state.

Either way, both L1 caches and the L2 cache will all hold the same copy of the line, which remains incoherent with that in the memory.

If core B issued a store request to the same line, the line will be invalidated from the core A's cache and will end up in the M state in core B's cache.

Eventually, the line will be evicted from the cache hierarchy to make space for other lines. When that happens, there are two cases:

  • The line is in the S/E state, so it will simply be dropped from all caches. Later, if the line is read again, the copy written by the DMA operation will be read from main memory.
  • The line is in the M state, so it will be written back to main memory and (potentially partially) overwrite the copy written by the DMA operation.

Obviously such incoherent state must never occur. It can be prevent by invalidating all relevant line from all caches before the DMA write operation begins and ensuring that no core accesses the area of memory being written to until the operation finishes. The DMA controller sends an interrupt whenever an operation completes. In case of a read DMA operation, all the relevant lines need to be written back to memory to ensure that the most recent values are used.

Intel Data Direct I/O (DDIO) technology enables the DMA controller to read or write directly from the shared last-level cache to improve performance.


This section is not directly related to the question, but I want to write this somewhere.

All commercial x86 CPUs are fully cache coherent (i.e., the whole cache hierarchy is coherent). To be more precise, all processors within the same shared memory domain are cache coherent. In addition, all commercial x86 manycore coprocessors (i.e., Intel Xeon Phi in the PCIe card form) are internally fully coherent. A coprocessor, which is a device on the PCIe interconnect, is not coherent with other coprocessors or CPUs. So a coprocessor resides in a separate coherence domain of its own. I think this is because there is no built-in hardware mechanism to make a PCIe device that has a cache coherent with other PCIe devices or CPUs.

Other than commercial x86 chips, there are prototype x86 chips that are not cache coherent. The only example I'm aware of is Intel's Single-Chip Cloud Computer (SCC), which has later evolved into coherent Xeon Phi.

Panathenaea answered 12/2, 2019 at 17:41 Comment(8)
+1 You have illuminated misconceptions of mine. The illumination is appreciated. The illumination is helpful. I had not known that my system had started in a disallowed state!Aguedaaguero
@thb: having an (older, clean) copy in L2 is impossible if it's dirty in an L1d. As Hadi says, this is already incoherent. But not just with RAM, also between levels of cache. MESI requires that if a line is in Modifed (dirty) state in any cache, all other caches have it Invalid. (Usually a line isn't written back until it's evicted from L1, so we end up with Modified in L2 and Invalid in other L1s. But there's a mechanism for inner caches to be able to read copies of the L2's dirty line before it's actually written back.)Mattah
Pure original MESI is about maintaining coherence of separate caches (e.g. multiple cores each with their own private caches), not a hierarchy with a shared last-level cache. This is why it talks about snooping a memory bus. (Was going to post the previous comment on the question, but put it here in case Hadi wants to update the answer to address this problem with the initial state that goes beyond DMA non-coherence into invalid MESI states, I think.)Mattah
@PeterCordes I've interpreted (older, clean) as the line exists in the E state. An implementation can check that if L2 has the line in the E or M state, then it must snoop all private caches. Also in another implementation, MESI could only be implemented at the L1 and SI could be implemented at the inclusive L2. In this case, if any core misses in its private cache and hits in the L2, it has to always snoop all the private caches. This is less efficient but requires less hardware at the L2. If the L2 is not inclusive, snoops have to be sent whether there is a hit or miss in the L2...Panathenaea
...The MESI protocol basically only matters when you hit in the L1. If the OP intended to have MESI at the L2 and meant that the line is in the S state there, then it is indeed incoherent because another core could read the stale version in the L2 (other than incoherent DMA).Panathenaea
Why is the system supposed to have coherent caches? Is it possible to have a non-coherent common L2 between cores, isn't it? Intel SCC is something of that kind (though it doesn't have a common cache between cores).Trypanosomiasis
@MargaretBloom It is certainly possible to have non-coherent caches. There are real processors with non-coherent L1 (but the L2 is shared and hence necessarily coherent), but I don't remember their names. That said, I don't think the question would be interesting if the whole system is non-coherent because in such a system, the behavior would be basically undefined if it ever reached an incoherent state (i.e., when two cores are using the same line and at least one of them is writing to it). The system would just be incoherent and there would be no guarantees what value ends up in memory.Panathenaea
Interesting example of Xeon Phi cards in a regular system. That's a good example for Is mov + mfence safe on NUMA?, where the answer revolves around the fact that all x86 multi-core systems are cache-coherent, or else they can't be a single system image that runs normal x86 software. I added a paragraph there with a Xeon Phi example linking here. (Ping @MargaretBloom)Mattah

© 2022 - 2024 — McMap. All rights reserved.