Physical or virtual addressing is used in processors x86/x86_64 for caching in the L1, L2 and L3?
Asked Answered
T

1

38

Which addressing is used in processors x86/x86_64 for caching in the L1, L2 and L3(LLC) - physical or virtual(using PT/PTE and TLB) and somehow does PAT(page attribute table) affect to it?

And is there difference between the drivers(kernel-space) and applications(user-space) in this case?


Short answer - Intel uses virtually indexed, physically tagged (VIPT) L1 caches: What will be used for data exchange between threads are executing on one Core with HT?

  • L1 - Virtual addressing (in 8-way cache for define Set is required low 12 bits which are the same in virt & phys)
  • L2 - Physical addressing (requires access to TLB for Virt-2-Phys)
  • L3 - Physical addressing (requires access to TLB for Virt-2-Phys)
Toscana answered 26/9, 2013 at 21:46 Comment(3)
You cannot address the cache. You can only address memory. The cache is handled by the CPU privately.Rarefy
@Kerrek SB Yes I know, but does CPU-cache use TLB and all overheads of virtual addressing or not?Toscana
L1 is still physically tagged, and as you say the indexing gets the speed of virtual but also the lack of aliasing of phyiscal. So it's really L1 - Physical; it behaves exactly like PIPT but with a couple cycles lower latency. Only the uop-cache is virtually-addressed in Intel CPUs. Please don't edit answers into your question.Mcneal
E
42

The answer to your question is - it depends. That's strictly a CPU design decision, which balances over the tradeoff between performance and complexity.

Take for example recent Intel Core processors - they're physically tagged and virtually indexed (at least according to http://www.realworldtech.com/sandy-bridge/7/). This means that the caches can only complete lookups in pure physical address space, in order to determine if the line is there or not. However, since the L1 is 32k, 8-way associative, it means that it uses 64 sets, so you need only address bits 6 to 11 in order to find the correct set. As it happens to be, virtual and physical addresses are the same in this range, so you can lookup the DTLB in parallel with reading a cache set - a known trick (see - http://en.wikipedia.org/wiki/CPU_cache for a good explanation).

In theory one can build a virtually index + virtualy tagged cache, which would remove the requirement to go through address translation (TLB lookup, and also page walks in case of TLB misses). However, that would cause numerous problems, especially with memory aliasing - a case where two virtual addresses map to the same physical one.

Say core1 has virtual addr A caches in such a fully-virtual cache (it maps to phys addr C, but we haven't done this check yet). core2 writes to virtual addr B that map to the same phys addr C - this means we need some mechanism (usually a "snoop", term coined by Jim Goodman) that goes and invalidates that line in core1, managing the data merge and coherency management if needed. However, core1 can't answer to that snoop since it doesn't know about virtual addr B, and doesn't store physical addr C in the virtual cache. So you can see we have an issue, although this is mostly relevant for strict x86 systems, other architectures may be more lax and allow a simpler management of such caches.

Regarding the other questions - there's no real connection with PAT that I can think of, the cache is already designed, and can't change for different memory types. Same answer for the other question - the HW is mostly beneath the distinction between user/kernel mode (except for the mechanisms it provides for security checking, mostly the various rings).

Evette answered 27/9, 2013 at 20:58 Comment(7)
Big thanks! And in your opinion, is there any benefit from the knowledge of the mechanism on x86 and whether the I as developer knowing this, can I somehow optimize the performance of my program?Toscana
Absolutely, a SW developer that doesn't know the HW he runs on would do a poor job in optimizing it (if needed to), or debugging it (when needed to :). The cache mapping address type is a little low level indeed, although it does open a hatch to some important optimiations such as SW prefetch intrinsics and cache-aware design). See this great post for examples - #16699747 . Also there's the question of out-of-order execution that might give some hints, and of course the variety of compiler optimizations (not HW, but important too)Evette
I mean - benefit from the knowledge that in x86: "they're are physically tagged and virtually indexed"Toscana
It's not x86, it's just a common design point that occurs on many CPUs. I'm pretty sure most ARM based designs are also utilizing this. To benefit from that, you need to make sure your addresses are not physically aligning too much on the tag bits (or at least have a good spread) - that's no easy task as you usually don't decide where the OS assigns your pages.Evette
Thanks! But if i can't affect to "where the OS assigns your pages", what benefit can I take from this?Toscana
Not much, probably nothing. Caches are designed to achieve best spread for addresses in order to minimize cache thrashing. It would have a far greater benefit to design your code to be cache friendly in general by tiling large structure to fit in it, or avoiding false sharing, than to worry about physical pages being scattered. I would pay attention to how the lower bits match (for e.g. when working with A and B arrays, try to have them at different page offsets), but that applies to virtual addresses and not specifically related to VIPT cachesEvette
re: the VIPT L1 speed hack that allows fetching tags (and data) from a set in parallel with the TLB access. It's really more like a PIPT cache, with the index translation happening for free (because the index bits are all below the page offset). I took a stab at writing a detailed explanation of why it works a while ago. You might want to link that as well as Wikipedia.Mcneal

© 2022 - 2024 — McMap. All rights reserved.