How does CPU make data request via TLBs and caches?

Asked 22/3, 2014 at 1:34 Answered 23/3, 2014 at 2:48

Solved caching cpu intel cpu-architecture tlb

I am observing the last few Intel microarchitectures (Nehalem/SB/IB and Haswell). I am trying to work out what happens (at a fairly simplified level) when a data request is made. So far I have this rough idea:

Execution engine makes data request
"Memory control" queries the L1 DTLB
If the above misses, the L2 TLB is now queried

At this point two things can happen, a miss or a hit:

If its a hit the CPU tries L1D/L2/L3 caches, page table and then main memory/hard disk in that order?
If its a miss- the CPU requests the (integrated memory controller?) to request checking the page table held in RAM (did I get the role of the IMC correct there?).

If somebody could edit/provide a set of bullet points which provide a basic "overview" of what the CPU does from the execution engine data request, including the

L1 DTLB (data TLB)
L2 TLB (data + instruction TLB)
L1D Cache (data cache)
L2 cache (data + instruction cache)
L3 cache (data + instruction cache)
The part of the CPU which controls access to main memory
Page table

it would be most appreciated. I did find some useful images:

but they didn't really separate the interaction between the TLBs and the caches.

UPDATE: Have changed the above as I think I now understand. The TLB just gets the physical address from the virtual one. If there's a miss- we're in trouble and need to check page table. If there's a hit we just proceed down through the memory hierarchy starting with the L1D cache.

Jarboe answered 22/3, 2014 at 1:34 Comment(1)

See also a question (from the same user) about whether the page-walk loads to resolve a TLB miss come from cache or not. I dug up some interesting stuff. – Volant 29/5, 2016 at 19:51

The pagemap is only applicable for virtual to physical address translation. However, as it's residing in memory and only partially cached in the TLBs, you may have to access it there during the translation process.

The basic flow is as follows:

Execution calculates the address (actually some calculations like scale and offsets could be done in the memory unit).
Lookup in the DTLB
2.a. If missed, lookup in the 2nd level TLB.
2.a.a. if missed - start a page walk.
2.a.b. if hit the 2nd level TLB, fill into the DTLB and proceed with the new physical address
2.b. is hit in the DTLB proceed with physical address
Lookup the L1, if missed - lookup the L2, if missed again lookup the L3, if missed - send to the memory controller, wait for DRAM access.
When data returns (from whichever level), fill in to the caches along the way (depending on fill policy, cache inclusiveness, and instruction temporality specifications, memory region type, and probably other factors as well).

If a pagewalk was required, stall main request, and issue physical loads to the pagemap (according to the architectural definition). In x86 it may include CR3, PDPTR, PDP, PDE, PTE, etc.. depending on the paging mode, page sizes, etc.. Note that under virtualization, each pagewalk level on the VM may require a full pagewalk on the host (so you actually square the number of steps needed).

Note that a pagemap is basically a tree structure, where each access depends on the value of the previous one (and part of the virtual address you translate). These accesses are therefore dependent, and only once the last one is done you get the physical address and can go back to #3. All along, the line you want may be sitting in your L1 without you being able to know (although to be honest, if you did a pagewalk you're not likely to still have the line in your upper caches).

Other important notes - the pagemap is in physical space and accessed that way. You don't want to have to translate the accesses you need for translation, that could be a deadlock :)
More importantly, the pagemap data can be cached, so while a simple memory access may expand to multiple ones due to a TLB miss, the pagewalk may still be fairly cheap.

Marable answered 22/3, 2014 at 23:34 Comment(2)

Great answer! I did check the Intel architecture manual 1a but it didnt have a diagram of this?? Could you recommend any resources? One thing- you have "STLB" for 2.a.b -for a moment I thought the "S" stood for "static" and then I looked at my keyboard and its next to the letter "D". – Jarboe 23/3, 2014 at 1:36

@user997112, sorry, typo. Actually STLB is used here and there (S meaning Second-level), but it's redundant there and I thought it would be clearer without that. As for diagrams, I don't know of any good ones aside from these, but for more in-depth overview you should read the Software developer manuals – Marable 23/3, 2014 at 6:16

Yes, as explained in a long description here:

http://lwn.net/Articles/252125/

the passage from CPU to L1 to L2 to L3 is pictorially illustrated.

enter image description here

Expulsive answered 23/3, 2014 at 2:48 Comment(0)

Recommended topics

Hot tags