Is TLB hierarchy inclusive on modern x86 CPU (e.g. Skylake, or maybe other Lakes)?
For example, prefetchtn
brings data to the level cache n + 1
as well as a corresponding TLB entry in DTLB. Will it be contained in the STLB as well?
Is TLB hierarchy inclusive on modern x86 CPU (e.g. Skylake, or maybe other Lakes)?
For example, prefetchtn
brings data to the level cache n + 1
as well as a corresponding TLB entry in DTLB. Will it be contained in the STLB as well?
AFAIK, on Intel SnB-family 2nd-level TLB is a victim cache for first-level iTLB and dTLB. (I can't find a source for this and IDK where I read it originally. So take this with a grain of salt. I had originally thought this was a well-known fact, but it might have been a misconception I invented!)
I thought this was documented somewhere in Intel's optimization manual, but it doesn't seem to be.
If this is correct, you get basically the same benefit of hitting in STLB some time later after the entry has been evicted from dTLB, but without wasting space on duplicate entries.
So for example if you keep code and data in the same page, you could get an iTLB miss when executing the code, and then a dTLB miss that also misses in the STLB and does another page walk if that code loads data from the same page. (That's on reason we don't keep read-only data in the same page as code on x86; it has no code-size advantage and wastes iTLB + dTLB coverage footprint by having the same page in both TLBs.)
But perhaps I'm wrong; Travis (@BeeOnRope) suggested using data prefetch to reduce iTLB miss cost; he's assuming that the page walker fills an entry in STLB and dTLB. (On Core 2(?) and later, TLB-miss software-prefetch can trigger a walk instead of giving up.)
I think L2 prefetching is likely to be very effective for code that would otherwise miss to DRAM. Yes, you don't warm the ITLB or the L1I, but you warm the L2 and STLB, so you are taking something like a dozen cycles for the the first execution.
This would work for a NINE STLB; it doesn't have to actually be inclusive, just not exclusive or a victim cache. (e.g. L2 cache is NINE wrt. L1i cache and L1d cache. They fetch through it, but lines can be evicted from L2 without forcing eviction from either L1 cache.)
Further details with links to source:
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Memory_Hierarchy
https://www.7-cpu.com/cpu/Skylake.html has timing results and TLB sizes, but not the info we're looking for.
Core 2 was different: https://www.realworldtech.com/nehalem/8/ says that has a tiny 16-entry L1dTLB used only for loads, and uses L2 DTLB for stores as well as L1dTLB-miss loads.
Nehalem changed that (64-entry DTLB) along with reorganizing the memory hierarchy to what's still used on client (non-server) chips: large shared inclusive LLC and 256k private L2. (And of course still the usual split 32k L1i/d) Which cache mapping technique is used in intel core i7 processor?
prefetcht1
. There is a topic on Intel official forum describing exactly that. Here is what Travis D. wrote: I think L2 prefetching is likely to be very effective for code that would otherwise miss to DRAM. Yes, you don't warm the ITLB or the L1I, but you warm the L2 and STLB, so you are taking something like a dozen cycles for the the first execution. –
Cerous prefetcht1
to work; NINE would be fine, too. (See updated answer). –
Delicacy © 2022 - 2024 — McMap. All rights reserved.
victim
andinclusive
did not give any results related to tlb. How did you discover the TLB behavior? Was it some personal research? Actually, I discovered a new thing that is not really related to the topic - LLC is non-inclusive since Skylake and a victim for the mid-level cache. – Cerous