How does the VIPT to PIPT conversion work on L1->L2 eviction
Asked Answered
C

1

3

This scenario came into my head and it seems a bit basic but I'll ask.

So there is a virtual index and physical tag in L1 but the set becomes full so it is evicted. How does the L1 controller get the full physical address from the virtual index and the physical tag in L1 so the line can be inserted into L2? I suppose it could search the TLB for the combination but that seems slow and also it may not be in the TLB at all. Perhaps the full physical address from the original TLB translation is stored in the L1 next to the cache line?

This also opens the wider question of how the PMH invalidates the L1 entry when it writes accessed bits to the PTEs and PDEs and so on. It is my understanding it interfaces with the L2 cache directly for physical addresses but when it writes accessed and modified bits, as well as sending an RFO if it needs to, it would have to reflect the change in the copy in the L1 if there is one, meaning it would have to know the virtual index of the physical address. In this case if the full physical address were also stored in the L1 then it offers a way for the L2 to be able to index it as well.

Cominform answered 27/3, 2019 at 23:9 Comment(0)
S
4

Yes, outer caches are (almost?) always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.


In Intel CPUs, the VIPT L1 caches have all the index bits from the offset-within-page part of the address, so virt=phys, avoiding any aliasing problems. It's basically PIPT but still being able to fetch data/tags from the set in parallel with the TLB lookup for the pagenumber bits to create an input for the tag comparator.

The full physical address is known just from L1d index + tag, again because it behaves like a PIPT for everything except load latency.


In the general case of virtually-indexed caches where some of the index bits do come from the page-number, that's a good question. Such systems do exist, and page-colouring is often used by the OS to avoid aliasing. (So they don't need to flush the cache on context switches.)

Virtually indexed physically tagged cache Synonym has a diagram for one such VIPT L1d: the physical tag is extended a few bits to come all the way down to the page offset, overlapping the top index bit.

Good observation that a write-back cache needs to be able to evict dirty lines long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere.

Having the tag include all the physical address bits above the page offset (even if that overlaps some index bits) solves this problem.

Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, being virtual isn't a problem.


But I don't think a TLB check at eviction could solve the problem for the non-overlapping tag case: you don't have the full virtual address anymore, only the low bits of your page-number are virtual (from the index), the rest are physical (from the tag). So this isn't a valid input to the TLB.

So besides being inefficient, there's also the equally important problem that it wouldn't work at all. :P Maybe there's some trick I don't know or something I'm missing, but I don't think even a special TLB indexed both ways (phys->virt and virt->phys) could work, because multiple mappings of the same physical page are allowed.


I think real CPUs that have used VIVT caches have normally had them as write-through. I don't know the history well enough to say for sure or cite any examples. I don't see how they could be write-back, unless they stored two tags (physical and virtual) for every line.

I think early RISC CPUs often had 8k direct-mapped caches.

But first-gen classic 5-stage MIPS R2000 (using external SRAM for its L1) apparently had a PIPT write-back cache, if the diagram in these slides labeled MIPS R2000 is right, showing a 14-bit cache index taking some bits from the physical page number of the TLB result. But it still works with 2 cycle latency for loads (1 for address-generation in the EX stage, 1 for cache access in the MEM stage).

Clock speeds were much lower in those days, and caches+TLBs have gotten larger. I guess back then a 32-bit binary adder in the ALU did have comparable latency to TLB + cache access, maybe not using as aggressive carry-lookahead or carry-select designs.

A MIPS 4300i datasheet, (variant of MIPS 4200 used in Nintendo 64) shows what happens where/when in its 5-stage pipeline, with some things happening on the rising or falling edge of the clock, letting it divide some things up into half-clocks within a stage. (so e.g. forwarding can work from the first half of one stage to the 2nd half of another, e.g. for branch target -> instruction fetch, still without needing extra latching between half-stages.)

Anyway, it shows DVA (data virtual address) calculation happening in EX: that's the register + imm16 from a lw $t0, 1234($t1). Then DTLB and DCR (data-cache read) happen in parallel in the first half of the Data Cache stage. (So this is a VIPT). DTC (Data Tag Check) and LA (load alignment e.g. shifting for LWL / LWR, or for LBU to extract a byte from a fetched word) happen in parallel in the 2nd half of the stage.

So I still haven't found confirmation of a single-cycle (after address calculation) PIPT MIPS. But this is definite confirmation that single-cycle VIPT was a thing. From Wikipedia, we know that its D-cache was 8-kiB direct-mapped write-back.

Sidesman answered 28/3, 2019 at 3:41 Comment(16)
In a VIPT cache, by definition, at least the whole physical page number (and maybe some page offset bits depending on the cache index size) is the physical tag, which is stored in the cache. The physical address of the cache line is reconstructed by appending the low bits of the virtual index to the physical tag. The Intel Itanium-2 uses a similar design called prevalidated tags wherein a pointer to the TLB is used as a physical tag in the L1 array rather than the physical tag itself. This saves area because ...Recept
...the size of the pointer is smaller and also redues hit access latency because there is no need to actually obtain and compare the physical tags from the TLB, but only the TLB pointers. However, this introduces additional complexity where the TLB pointers need to be maintained. Also the actual physical tag need to be obtained from the TLB when the cache line needs to be evicted from the L1 cache.Recept
See: Itanium 2 Processor Microarchitecture.Recept
@HadiBrais: By definition? I thought the definition was that any bits that are in the tag are from the physical address. So you're claiming that a read-only or write-through cache where the tag bits come only from the physical address, but don't include the whole physical address, couldn't be called VIPT? (e.g. like the example I linked, but if the tag ended 1 bit higher so it didn't overlap with the index.) But thanks for the Itanium-2 info, that's an interesting idea. So TLB eviction might have to scan L1d to avoid orphaning a line...Sidesman
If the tag of a VIPT cache does not include all of the PFN bits, then I don't think the cache would even work because it's not sufficient to compare only a subset of the PFN bits and the virtual index to identify a cache line even within a single address space. The example you linked does seem to show that the whole PFN is the tag (which is sufficient because bits 0-11 of the physical address are fully included in the cache line offset and part of the virtual index). Although I'm not sure why the tag is 28-bits in that example.Recept
@HadiBrais: Oh yes, I keep forgetting about the problem of mixing virt and phys for different parts of the address. I only noticed that for the TLB-lookup idea near the end of writing this answer. You're right, if we can't depend on the OS doing page-colouring to avoid synonym problems, (effectively making more address bits translate for free), it would have something like a homonym problem but without even having to change the page table to trigger it. Merely having two adjacent physical pages (PFN differs only in the low bit) mapped to virtual pages that alias the same set would do it.Sidesman
re: why the tag is 28 bits: it's from cse.unsw.edu.au/~cs9242/02/lectures/03-cache/node8.html, for a MIPS R4x00, so presumably 28+12 = 40 is the physical address width on that 64-bit MIPS uarch.Sidesman
According to Chapter 4 of the manual, the size of the virtual address is 40 bits and the size of the physical address is 36 bits. Also Chapter 11 shows that the size of the tag is 24 bits, which is exactly what it should be. So that figure is wrong.Recept
@HadiBrais: I thought 40-bit physical sounded big for CPUs that old. I wonder where that diagram is actually from, then. Fortunately it doesn't really matter, it's still a valid example of at least a hypothetical system.Sidesman
Yeah me too. I tried to search using Google Image, but got nothing.Recept
In the post you linked, the other answer has correctly pointed out that particular error, but the answer got, very unfortunately, deleted. Argh!Recept
@HadiBrais: Oh yes, I'd forgotten the diagram showed the physical address width, too, so the diagram is even self-contradictory. I left a comment on the other question to point it out.Sidesman
@PeterCordes I found a diagram on google for a 32 byte line VIPT cache m.imgur.com/gallery/CmTZ6dC that shows that bit 11-6 in the VA is the same as the PA and appears in the tag, this make sense because it would be the same offset in the page so of course they're going to be the same. The physical address is known from Tag (PFN + virtual index (same as physical index)) + Offset, so it makes sense to me now. Page colouring is hence not made more complicated and can still be done based on PFN. I think a process has a current colour and each page allocation cycles through the colours.Cominform
Actually, no. Only part of the virtual index is stored in the tag which is enough to know full physical address. This means that it would affect colouring because the colouring algorithm would have to take into consideration those 2 bits of the VFN as wellCominform
I think it depends on what you want to optimise for. Optimising for the PIPT lower level caches requires each physical page to be assigned a colour and an incrementing colour is selected for each fault in the process. For VIPT the optimisation isn't done when actually allocating a physical page for the virtual address but rather the selection of virtual addresses to use within the address space of the process itself in the first placeCominform
@LewisKelsey: page colouring is normally just a term for matching a physical address bit with a virtual address bit, so more bits effectively translate for free. So you don't have to "assign" a colour to a physical page, its address determines the colour. (Other than that, I didn't really take the time to follow your comments, sorry. I'm not that interested in theoretical cache-design possibilities like this that aren't used on any CPUs I care about, at least not right now.)Sidesman

© 2022 - 2024 — McMap. All rights reserved.