How is an LRU cache implemented in a CPU?

Asked 3/5, 2014 at 18:50 Answered 14/9, 2014 at 17:2

Solved caching cpu cpu-architecture cpu-cache lru

I'm studying up for an interview and want to refresh my memory on caching. If a CPU has a cache with an LRU replacement policy, how is that actually implemented on the chip? Would each cache line store a timestamp tick?

Also what happens in a dual core system where both CPUs write to the one address simultaneously?

Fontaine answered 3/5, 2014 at 18:50 Comment(6)

You may want to take a look here. – Candlelight 3/5, 2014 at 18:56

Thank you that is useful. I also added some more to my qn. – Fontaine 3/5, 2014 at 19:25

timestamp is not tick, but some short value (remember, that LRU works independent for each cache set. Dual core system has several caches levels, some levels are private to the core (only owner core may request cache to store new line). – Translative 3/5, 2014 at 19:43

What would the short value be set to in practice? – Fontaine 3/5, 2014 at 20:46

Possibly of interest: What cache invalidation algorithms are used in actual CPU caches? – Finitude 3/5, 2014 at 22:40

The last line is unrelated to LRU and deserves a question of its own. In a nutshell - you can't write without ownership of that line, which has to be visible globally. – Endowment 3/5, 2014 at 22:51

For a traditional cache with only two ways, a single bit per set can be used to track LRU. On any access to a set that hits, the bit can be set to the way that did not hit.

For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value.

For a 4-way cache, the following encoding of the state that would seem to work reasonably well: two bits for the most recently used way number, two bits for the next most recently used way number, and a bit indicating if the higher or lower numbered way was more recently used.

On a MRU hit, the state is unchanged.
On a next-MRU hit the two bit fields are swapped.
On other hits, the numbers of the two other ways are decoded, the number of the way that hits is placed in the first two-bit portion and the former MRU way number is placed in the second two-bit portion. The final bit is set based on whether the next-MRU way number is higher or lower than the less recently used way that did not hit.
On a miss, the state is updated as if an LRU hit had occurred.

Because LRU tracking has such overhead, simpler mechanisms like binary tree pseudo-LRU are often used. On a hit, such just updates each branching part of the tree with which half of the associated ways the hit was in. For a power of two number of ways W, a binary tree pLRU cache would have W-1 bits of state per set. A hit in way 6 of an 8-way cache (using a 3-level binary tree) would clear the bit at the base of the tree to indicate that the lower half of the ways (0,1,2,3) are less recently used, clear the higher bit at the next level to indicate that the lower half of those ways (4,5) are less recently used and set the higher bit in the final level to indicate that the upper half of those ways (7) is less recently used. Not having to read this state in order to update it can simplify hardware.

For skewed associativity, where different ways use different hashing functions, something like an abbreviated time stamp has been proposed (e.g., "Analysis and Replacement for Skew-Associative Caches", Mark Brehob et al., 1997). Using a miss counter is more appropriate than a cycle count, but the basic idea is the same.

With respect to what happens when two cores try to write to the same cache line at the same time, this is handled by only allowing one L1 cache to have the cache line in the exclusive state at a given time. Effectively there is a race and one core will get exclusive access. If only one of the writing core already has the cache line in a shared state, it will probably be more likely to win the race. With the cache line in shared state, the cache only needs to send an invalidation request to other potential holders of the cache line; with the cache line not present a write would typically need to request the cache line of data as well as asking for exclusive state.

Writes by different cores to the same cache line (whether to the same specific address or, in the case of false sharing, to another address within the line of data) can result in "cache line ping pong", where different cores invalidate the cache line in other caches to get exclusive access (to perform a write) so that the cache line bounces around the system like a ping pong ball.

Finitude answered 3/5, 2014 at 22:39 Comment(0)

There is a good slide-deck Page replacement algorithms that talks about various page replacement schemes. It also explains the LRU implementation using mxm matrix really well.

Leyte answered 14/9, 2014 at 17:2 Comment(2)

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Otoscope 14/9, 2014 at 17:56

Yup, page is now inaccessible. – Whitcher 1/1, 2023 at 21:36

Recommended topics

Hot tags