Is TLB inclusive?
Asked Answered
C

1

2

Is TLB hierarchy inclusive on modern x86 CPU (e.g. Skylake, or maybe other Lakes)?

For example, prefetchtn brings data to the level cache n + 1 as well as a corresponding TLB entry in DTLB. Will it be contained in the STLB as well?

Cerous answered 12/4, 2020 at 20:11 Comment(0)
D
5

AFAIK, on Intel SnB-family 2nd-level TLB is a victim cache for first-level iTLB and dTLB. (I can't find a source for this and IDK where I read it originally. So take this with a grain of salt. I had originally thought this was a well-known fact, but it might have been a misconception I invented!)

I thought this was documented somewhere in Intel's optimization manual, but it doesn't seem to be.

If this is correct, you get basically the same benefit of hitting in STLB some time later after the entry has been evicted from dTLB, but without wasting space on duplicate entries.

So for example if you keep code and data in the same page, you could get an iTLB miss when executing the code, and then a dTLB miss that also misses in the STLB and does another page walk if that code loads data from the same page. (That's on reason we don't keep read-only data in the same page as code on x86; it has no code-size advantage and wastes iTLB + dTLB coverage footprint by having the same page in both TLBs.)


But perhaps I'm wrong; Travis (@BeeOnRope) suggested using data prefetch to reduce iTLB miss cost; he's assuming that the page walker fills an entry in STLB and dTLB. (On Core 2(?) and later, TLB-miss software-prefetch can trigger a walk instead of giving up.)

I think L2 prefetching is likely to be very effective for code that would otherwise miss to DRAM. Yes, you don't warm the ITLB or the L1I, but you warm the L2 and STLB, so you are taking something like a dozen cycles for the the first execution.

This would work for a NINE STLB; it doesn't have to actually be inclusive, just not exclusive or a victim cache. (e.g. L2 cache is NINE wrt. L1i cache and L1d cache. They fetch through it, but lines can be evicted from L2 without forcing eviction from either L1 cache.)


Further details with links to source:


Core 2 was different: https://www.realworldtech.com/nehalem/8/ says that has a tiny 16-entry L1dTLB used only for loads, and uses L2 DTLB for stores as well as L1dTLB-miss loads.

Nehalem changed that (64-entry DTLB) along with reorganizing the memory hierarchy to what's still used on client (non-server) chips: large shared inclusive LLC and 256k private L2. (And of course still the usual split 32k L1i/d) Which cache mapping technique is used in intel core i7 processor?

Delicacy answered 12/4, 2020 at 20:22 Comment(6)
Unfortunately, it's not documented in Intel's optimization manual. At least searching for the keywords victim and inclusive did not give any results related to tlb. How did you discover the TLB behavior? Was it some personal research? Actually, I discovered a new thing that is not really related to the topic - LLC is non-inclusive since Skylake and a victim for the mid-level cache.Cerous
The original problem I was trying to solve was prefetching code to L2 with prefetcht1. There is a topic on Intel official forum describing exactly that. Here is what Travis D. wrote: I think L2 prefetching is likely to be very effective for code that would otherwise miss to DRAM. Yes, you don't warm the ITLB or the L1I, but you warm the L2 and STLB, so you are taking something like a dozen cycles for the the first execution.Cerous
Having said that, your proposition regarding on Intel SnB-family 2nd-level TLB is a victim cache for first-level iTLB and dTLB is not obvious to me and would require some proofs.Cerous
@SomeName: I had thought it was a well-known fact, but since you pointed it out I didn't find it in Intel's optimization manual either. I searched on "DTLB" and "STLB" in case they describe eviction without using the word "victim". Now I'm searching to find out where I read that. It wasn't my own experimental testing. Maybe Agner Fog? Checking that now. Oh and BTW, LLC on Skylake-client (dual / quad cores like i7-6700k) is still inclusive, and they still use the same ring bus architecture. Only Skylake-server (with AVX512) uses a mesh and NINE LLC.Delicacy
@SomeName: I still haven't found anything; Updated my answer to add caveats. Note that STLB wouldn't have to be inclusive for prefetcht1 to work; NINE would be fine, too. (See updated answer).Delicacy
Seems that the STLB is indeed non-inclusive and non-exclusive, and not a victim cache: usenix.org/system/files/sec22fall_tatar.pdf (Table 1)Drinker

© 2022 - 2024 — McMap. All rights reserved.