What makes a TLB faster than a Page Table if they both require two memory accesses?

Asked 15/12, 2016 at 20:53 Answered 4/7, 2021 at 10:46

Solved memory operating-system cpu-architecture paging tlb

Just going off wikipedia:

The page table, generally stored in main memory, keeps track of where the virtual pages are stored in the physical memory. This method uses two memory accesses (one for the page table entry, one for the byte) to access a byte. First, the page table is looked up for the frame number. Second, the frame number with the page offset gives the actual address. Thus any straightforward virtual memory scheme would have the effect of doubling the memory access time. Hence, the TLB is used to reduce the time taken to access the memory locations in the page table method.

So given that, what I'm curious about is why the TLB is actually faster because from what I know it's just a smaller, exact copy of the page table.

You still need to access the TLB to find the physical address, and then once you have that, you still need to actually access the data at the physical address, which is two lookups just like with the page table.

I can only think of two reasons why the TLB is faster:

looking up an address in the TLB or page table is not O(n) (I assumed it's O(1) like a hash table). Thus, since the TLB is much smaller, it's faster to do a lookup. Also in this case, why not just use a hash table instead of a TLB?
I incorrectly interpreted how the TLB works, and it's not actually doing two accesses.

Saltus answered 15/12, 2016 at 20:53 Comment(2)

Feel free for any queries. – Jackqueline 16/12, 2016 at 6:13

The supposition that reading the page table requires one memory access is incorrect due to multilevel paging in modern CPUs (see my answer). – Volost 4/1, 2020 at 13:27

I realize it has been three years since this question was asked, but since it is still just as relevant, and it still shows up in search engines, I'll try my best to produce a complete answer.

Accessing the main memory through the TLB rather than the page table is faster primarily for two reasons:

1. The TLB is faster than main memory (which is where the page table resides).

The typical access time is in the order of <1 ns for the TLB and 100 ns for main memory
A TLB access is part of an L1 cache hit, and modern CPUs can do 2 loads per clock if they both hit in L1d cache.

The reasons for this are twofold:

The TLB is located within the CPU, while main memory - and thus the page table - is not.
The TLB - like other caches - is made of fast and expensive SRAM, whereas main memory usually consists of slow and inexpensive DRAM (read more here).

Thus, if the supposition that both the TLB and page table require only one memory access was correct, a TLB hit would still, roughly speaking, halve memory access time. However, as we shall see next, the supposition is not correct, and the benefit of having a TLB is even greater.

2. Accessing the page table usually requires multiple memory accesses.

This really is the crux of the issue.

Modern CPUs tend to use multilevel page tables in order to save memory. Most notably, x86-64 page tables currently consist of up to four levels (and a fifth may be coming). This means that accessing a single byte in memory through the page table requires up to five memory accesses: four for the page table and one for the data. Obviously the cost would be unbearably high if not for the TLB; it is easy to see why CPU and OS engineers put in a lot of effort to minimize the frequency of TLB misses.

Finally, do note that even this explanation is somewhat of a simplification, as it ignores, among other things, data caching. The detailed mechanics of modern desktop CPUs are complex and, to a degree, undisclosed. For a more detailed discussion on the topic, refer to this thread, for instance.

Page-table accesses can and are cached by data caches on modern CPUs, but the next access in a page-walk depends on the result of the first access (a pointer to the next level of the page table), so a 4-level page walk would have about 4x 4 cycle = 16 cycle latency even if all accesses hit in L1d cache. That would be a lot more for the pipeline to hide than the ~3 to 4 cycle TLB latency that's part of an L1d cache hit load in a modern Intel CPU (which of course uses TLBs for data and instruction accesses).

Volost answered 4/1, 2020 at 12:55 Comment(1)

Indeed, the TLB is a fast CAM that's typically accessed in parallel with L1 cache (while fetching tags/data from a set). VIPT Cache: Connection between TLB & Cache?. That's how modern microarchitectures can run 2 loads per clock cycle when they hit in dTLB and L1d cache. – Sacker 12/3, 2021 at 9:36

You are right in your assumption that approach with TLB still requires 2 accesses. But the approach with TLB is faster because:

TLB is made of faster memory called associative memory

Usually we make 2 memory accesses to physical memory but with TLB there is 1 access to TLB and other access is to physical memory.

Associative memory is faster because it is content addressable memory but its expensive too , because of the extra logic circuits required.

You can read about the content addressable memory here.

Jackqueline answered 16/12, 2016 at 6:12 Comment(2)

This answer is misleading as well as partially incorrect, and it answers a whole different question ("Why is TLB access time constant?"). It's not content-addressability that makes the TLB faster than main memory - all that means is that the access time is constant, which holds true for main memory as well. Also, the assumption that two memory accesses are required in both cases is not correct due to multilevel paging (see my answer). – Volost 4/1, 2020 at 13:29

I wouldn't say "usually". Have there been any real-world CPU designs with paged virtual memory without at least a tiny TLB, like at least 1 entry? I don't think that's what you meant, but better wording might be "without a TLB, we'd have to walk the page table (accessing via L1d cache) for every load, store, and code-fetch, taking 2 to 4 extra accesses per architectural memory access". – Sacker 12/3, 2021 at 9:41

It depends upon the specific implementation. In general, the TLB is a cache that exists within the CPU.

You still need to access the TLB to find the physical address, and then once you have that, you still need to actually access the data at the physical address, which is two lookups just like with the page table.

The CPU can access the cache much faster than it can access data through the memory bus. It is making two accesses to two different places (one faster and one slower). Also, it is possible for the memory location to be cached within the CPU as well, in which case no accesses are required to go through the memory bus.

Extravagate answered 17/12, 2016 at 4:32 Comment(1)

+1 for including a mention of caching. Other than that, the answer is incomplete and missing the fact that modern CPUs use multilevel paging (see my answer), meaning that reading the page table requires multiple memory accesses, not just one. – Volost 4/1, 2020 at 13:26

I think @ihohen said it pretty much but as a student to future students may come here, in simple words an explanation is:
" Without a TLB in a single level paging you need 2 accesses to main memory:
1 for finding the translation of the logical adress in the page table (which is placed in main memory) and 1 another for actually accessing the memory block ".
Now with a TLB , you reduce the above only to one access (the second one) because the step of finding the translation (hopefully) will take place without needing to access main memory because you will find the translation in the TLB which placed in cpu ".

So when we say that a TLB reduces access time by 2 , we mean that approximately if we ignore the case of a TLB miss, and consider the simplest model of paging (the single level one) then is fair to say that a TLB speeds up the process by 2.

There will be many variations, because first and foremost today's computers will use advanced paging techniques (multilevel, demand paging e.t.c) but this sentence is an intuitive explanation as to why the idea of TLB is much more helpful than a simple page table.

The book "Operating Systems " by Silberschatz states another (a little bit more detailed) math type to measure access time with a TLB:
Consider:
h : TLB hit ratio
τ : time to access main memory
e : time spend searching to find TLB registration
t = h * (e + τ) + (1-h)*(e + 2τ)

Henrieta answered 4/7, 2021 at 10:46 Comment(0)

Recommended topics

Hot tags