VIPT Cache: Connection between TLB & Cache?
Asked Answered
G

1

7

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details.

In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache.

From the TLB we get the traslated physical address. From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set).

Then the translated TLB address is matched with the list of tags to find a candidate.

  • My question is where is this check performed ?
    • In Cache ?
    • If not in Cache, where else ?
  • If the check is performed in Cache, then
    • is there a side-band connection from TLB to the Cache module to get the translated physical address needed for comparison with the tag addresses?

Can somebody please throw some light on "actually" how this is generally implemented and the connection between Cache module & the TLB(MMU) module ?

I know this dependents on the specific architecture and implementation. But, what is the implementation which you know when there is VIPT cache ?

Thanks.

Gan answered 29/9, 2017 at 0:16 Comment(0)
I
11

At this level of detail, you have to break "the cache" and "the TLB" down into their component parts. They're very tightly interconnected in a design that uses the VIPT speed hack of translating in parallel with tag fetch (i.e. taking advantage of the index bits all being below the page offset and thus being translated "for free". Related: Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?)

The L1dTLB itself is a small/fast Content addressable memory with (for example) 64 entries and 4-way set associative (Intel Skylake). Hugepages are often handled with a second (and 3rd) array checked in parallel, e.g. 32-entry 4-way for 2M pages, and for 1G pages: 4-entry fully (4-way) associative.

But for now, simplify your mental model and forget about hugepages. The L1dTLB is a single CAM, and checking it is a single lookup operation.

"The cache" consists of at least these parts:

  • the SRAM array that stores the tags + data in sets
  • control logic to fetch a set of data+tags based on the index bits. (High-performance L1d caches typically fetch data for all ways of the set in parallel with tags, to reduce hit latency vs. waiting until the right tag is selected like you would with larger more highly associative caches.)
  • comparators to check the tags against a translated address, and select the right data if one of them matches, or trigger miss-handling. (And on hit, update the LRU bits to mark this way as Most Recently Used). For a diagram of the basics for a 2-way associative cache without a TLB, see https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec16.pdf#page=17. The = inside a circle is the comparator: producing a boolean true output if the tag-width inputs are equal.

The L1dTLB is not really separate from the L1D cache. I don't actually design hardware, but I think a load execution unit in a modern high-performance design works something like this:

  • AGU generates an address from register(s) + displacement. (And segment base if non-zero.)

    (Fun fact: Sandybridge-family optimistically shortcuts this process for simple addressing mode: [reg + 0-2047] has 1c lower load-use latency than other addressing modes, if the reg value is in the same 4k page as reg+disp. Is there a penalty when base+offset is in a different page than the base?)

  • The index bits come from the offset-within-page part of the address, so they don't need translating from virtual to physical. Or translation is a no-op. This VIPT speed with the non-aliasing of a PIPT cache works as long as L1_size / associativity <= page_size. e.g. 32kiB / 8-way = 4k pages.

    The index bits select a set. Tags+data are fetched in parallel for all ways of that set. (This costs power to save latency, and is probably only worth it for L1. Higher-associativity (more ways per set) L3 caches definitely not)

  • The high bits of the address are looked up in the L1dTLB CAM array.

  • The tag comparator receives the translated physical-address tag and the fetched tags from that set.

  • If there's a tag match, the cache extracts the right bytes from the data for the way that matched (using the offset-within-line low bits of the address, and the operand-size).

Or instead of fetching the full 64-byte line, it could have used the offset bits earlier to fetch just one (aligned) word from each way. CPUs without efficient unaligned loads are certainly designed this way. I don't know if this is worth doing to save power for simple aligned loads on a CPU which supports unaligned loads.

But modern Intel CPUs (P6 and later) have no penalty for unaligned load uops, even for 32-byte vectors, as long as they don't cross a cache-line boundary. Byte-granularity indexing for 8 ways in parallel probably costs more than just fetching the whole 8 x 64 bytes and setting up the muxing of the output while the fetch+TLB is happening, based on offset-within-line, operand-size, and special attributes like zero- or sign-extension, or broadcast-load. So once the tag-compare is done, the 64 bytes of data from the selected way might just go into an already-configured mux network that grabs the right bytes and broadcasts or sign-extends.

AVX512 CPUs can even do 64-byte full-line loads.


If there's no match in the L1dTLB CAM, the whole cache fetch operation can't continue. I'm not sure exactly how CPUs manage to pipeline this so other loads can keep executing while the TLB-miss is resolved; probably the execution unit redoes the load or store-address after the L1dTLB is updated. That process involves checking the L2TLB (Skylake: unified 1536 entry 12-way for 4k and 2M, 16-entry for 1G), and if that fails then with a page-walk.

I assume that a TLB miss results in the tag+data fetch being thrown away. They'll be re-fetched once the needed translation is found. There's nowhere to keep them while other loads are running.

At the simplest, it could just re-run the whole operation (including fetching the translation from L1dTLB) when the translation is ready, but it could lower the latency for L2TLB hits by short-cutting the process and using the translation directly instead of putting it into L1dTLB and getting it back out again.

Obviously that requires that the dTLB and L1D are really designed together and tightly integrated. Since they only need to talk to each other, this makes sense. Hardware page walks fetch data through the L1D cache. (Page tables always have known physical addresses to avoid a catch 22 / chicken-egg problem).

On Intel CPUs, load uops apparently leave the scheduler as soon as they're dispatched to an execution unit, so they aren't getting replayed from there. Maybe load execution units or load-buffer entries track pending TLB misses? Load-buffer entries track in-flight cache misses, but in that case address-translation is complete and known to be non-faulting and they just have to forward the data to dependent uops when it arrives.

is there a side-band connection from TLB to the Cache?

I wouldn't call it a side-band connection. The load/store AGUs are the only things that use the L1dTLB. Similarly, L1iTLB is used only by the code-fetch unit which also reads L1i cache. (The page-walk hardware also needs to update TLB entries, perhaps separately from a load or store-address execution-unit. TLB updates can be triggered by hardware prefetch logic, to prefetch the TLB entry for the next virtual page.)

If there's a 2nd-level TLB, it's usually unified, so both the L1iTLB and L1dTLB check it if they miss. Just like split L1I and L1D caches usually check a unified L2 cache if they miss.

Outer caches (L2, L3) are pretty universally PIPT. Translation happens during the L1 check, so physical addresses can be sent to other caches.

Insolent answered 29/9, 2017 at 2:4 Comment(21)
caveat: I'm not a real CPU architect, so my understanding might be flawed. Some of the details of my examples might be off. But see realworldtech.com/haswell-cpu/5, and note that the L1dTLB block is stuck to the L1D block, not connected by an arrow like the AGU -> L1D block. David Kanter is a CPU microarchitecture analyst (and his articles on SnB, HSW and Bulldozer are excellent), so this confirms what I'm saying in this answer.Insolent
Supporting unaligned loads can be done in two ways: (1) using multiple aligned loads, (2) design a more complex circuit for handling unaligned loads with high perf but at the cost of additional hardware and power overhead. Let's say we want to design a circuit to support aligned 32-byte, 16-byte, 8-byte, and 4-byte loads out of 64-byte cache lines. One possible design is to use 64 simple on/off gates for each of the 64 bytes. The enable signal for each gate is calculated using a simple logical combination of the top 4 bits of the 6-bit line offset...Wicket
... For example, to get the top 16 bytes, the enable signal would be calculating as 5th bit AND 4th bit. The other two bits are masked to one using the load size mask so that enable signal is turned on for the specified gates. If we want to get the bottom 16 bytes, the enable signal is NOT(5th) AND NOT(4th) and so on. To support an unaligned loads efficiently, the calculation for the enable signals is more complicated. To support 2-byte and 1-byte loads, another layer of gates could be added in one possible design (which may incur +1 cycle latency)...Wicket
...Those gates that are enabled output the loaded bits and those gates that are disabled output zeros. Then all the bits get shifted as required and the most significant bit is OR'ed with all higher bits to implement zero and sign extension. Finally, the result is written into the specified register. Other designs are possible, but supporting unlaigned loads in hardware is always more complicated.Wicket
In modern processors all TLBs and the page walker have MSHRs similar to the L1D. If a request missed in the TLB, it is aborted and the loads that require that page table entry are all blocked in the load buffer. Later when the TLB is filled, the loads are woken up and replayed from the load buffer.Wicket
@HadiBrais: The description in my answer is an attempt to explain P6 / SnB-family behaviour where unaligned load uops have exactly the same latency as aligned. (Before Nehalem, movups 16-byte loads decoded to more uops. Core2 and P4 are the only uarches with single-uop 16-byte aligned loads; earlier P6-family split into 8-byte halves. But unaligned 8-byte loads that don't split a cache line are still atomic. And I think no latency penalty. There's definitely no latency penalty for unaligned loads on SnB-family.)Insolent
@HadiBrais: Anyway, do you think it's likely that Intel's L1d caches do the necessary decoding to enable/disable bytes for data fetch from all 8 ways in parallel, vs. only doing that decoding for the one way selected by tag comparators? Before SKX / AVX512 that would make the tag-comparator only have to mux at most 32 bytes, but with SKX 64-byte unaligned loads are a thing so any number of bytes from 1 to 64 might be needed from a line for (part of) a load. The design could have changed at any point as vectors widened, though.Insolent
Right. I was only describing how would supporting unaligned loads is more complicated in general, not how they are implemented in any particular processor. Note that the question is not specific to any processor.Wicket
I don't know exactly. I think that byte selection and shift logic is after tag comparison. A 64-byte unaligned load crosses at most a single line. Does SKX decode a 64-byte unaligned load into 2 load uops? This would impose a perf penalty by either requiring two load ports and/or using a merge logic. If we want to avoid any performance penalty, then we need to be able to select and handle two lines from within a single port. This would require a lot of additional hardware real estate.Wicket
@HadiBrais: no, vmovups zmm or dqu32 is still a single uop according to Agner Fog, with no penalty if the data is aligned at runtime. I assume it just widens the split-load registers. (The perf counter description for ld_blocks.no_sr describes it as blocked because "all resources for handling the split accesses are in use". software.intel.com/en-us/vtune-amplifier-help-split-loads uses the term "split registers".) I assume the load then works like 2 unaligned loads of arbitrary width from the 2 lines it touches, merging into a split register.Insolent
uops.info shows that IACA models VMOVUPS (ZMM, K, M512) as 2 uops, in contrast to Agner's table. Also Agner's table shows a higher latency for the instruction. Overall, it is not clear to me whether these numbers actually represent the unaligned case.Wicket
@HadiBrais: I was looking at non-masked loads. The p05 ALU uop is obviously for masking. Note that Agner Fog's table has 2 rows: one for no masking (pure load for vmovdq[au]8/16/32/64 v,m), and one with masking (1 micro-fused ALU+load uop for vmovdqu[au]8/16/32/64 v{k},m). Anyway, even in the IACA output, you can see that the extra uop is p05, not p23, so it's not a load uop.Insolent
Section B.5.4.4 of the optimization manual does mention split registers (they exist since Nehalem to handle line split loads and stores). It says they are in the L1D. Note that two consecutive lines (in the same 4K page) are mapped to two consecutive sets, and potentially in different ways. So either two loads ports must be used or one port gets occupied for an additional cycle. I think a split register is used to buffer the two lines while they are being fetched from their respective sets, and then it extracts the required bytes.Wicket
Oh, it looks like uops.info only showed me the masked one and Agner's table only shows the non-masked one and I thought they are the same. But still if I understand Agner's numbers, the latency is higher than normal I think.Wicket
@HadiBrais: no, Agner's tables show both. Do you have the latest version? There are 2 consecutive rows in the SKX table, for yz, m and v{k}, m.Insolent
@HadiBrais: yes, that's similar to what I was picturing split regs being. I just checked, and split-load uops do get replayed. uops_dispatched_port.port_2 + port_3 = 2x number of mov rdi, [rdi] instructions executed. But the counts for uops_issued.any (fused domain issue/rename) and uops_executed.thread (unfused-domain) only count the load once.Insolent
Isn't possible that the load uop is split in the RS to occupy two RS and two load buffer entries? Either way I think this would cut the throughput in half, if not also increases the latency.Wicket
@HadiBrais: You can't detect a split load until after AGU, which requires the register inputs to be ready (unless it's an absolute or RIP-relative addressing mode). Allocating a 2nd spot in the RS after dispatching the load once and finding it split doesn't make sense, so I don't think this is plausible. We already know that split loads have half throughput and more latency. How can I accurately benchmark unaligned access speed on x86_64. Hopefully if the first line misses in cache, the 2nd line can still start fetching before it arrives?Insolent
I think there is always a penalty when crossing a cache line even for 64-byte loads on SKX. Like I said in my first comment, there are two ways to support unaligned ways and your test for line-crossing mov rdi, [rdi] confirms that two aligned load requests (counted as two unfused domain uops) are used to implement the unaligned load and a split register is allocated after AGU to handle the merging. There are still two open questions: how many split registers are there and can the second aligned load be performed even if the first missed? I think the L1 next-line pf would be very useful here.Wicket
@HadiBrais: Yes of course any cache-line split has a penalty. I was talking about each side of 64-byte unaligned load separately. (And at one point talking about 64-byte aligned loads just to point out that SKX at least does sometimes need to fetch the data for the entire line.) Oh hmm, I wonder if wide vector loads maybe use a different mechanism that doesn't fetch data in parallel with tags? That might explain the 1 cycle higher latency for 256-bit vector loads vs. 128-bit or scalar, on SKL. (But That's store forwarding, not L1d access agner.org/optimize/blog/read.php?i=415#854)Insolent
@HadiBrais: re: can both halves of a split have outstanding miss requests: I thought about this some more while editing an update on How can I accurately benchmark unaligned access speed on x86_64. If I understand correctly, a cache miss load uop does not itself need to be replayed, just dependent uops. So there should be no obstacle to a split load asking for itself to be replayed to load the other side of the split, whether or not it hit in L1d. The virtual address to load from is not dependent on the load data. So I don't see any obstacle to doing thInsolent

© 2022 - 2024 — McMap. All rights reserved.