How does the indexing of the Ice Lake's 48KiB L1 data cache work?
Asked Answered
N

2

7

The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture.

Ice Lake's 48KiB L1 Data cache and its 8-way associativity 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors.

This baffled me because:

  • There are 96 sets (48 KiB / 64 / 8), which is not a power of two.
  • The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages.

All in all, it seems that the cache is more expensive to handle but the latency increased only slightly (if it did at all, depending on what Intel means exactly with that number).

With a bit of creativity, I can still imagine a fast way to index 96 sets but point two seems an important breaking change to me.

What am I missing?

Neediness answered 19/1, 2020 at 12:25 Comment(0)
R
10

The optimization manual is wrong.

According to the CPUID instruction, the associativity is 12 (on a Core i5-1035G1). See also uops.info/cache.html and en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client).

This means that there are 64 sets, which is the same as in previous microarchitectures.

Raggedy answered 19/1, 2020 at 15:6 Comment(0)
G
6

Both the optimization manual and the datasheet of the processor family (Section 2.4.2) mention that the L1 data cache is 8-way associative. Another source is InstLatx64, which provides cpuid dumps for many processors including Ice Lake processors. Take for example the dump for i7-1065G7

CPUID 00000004: 1C004121-02C0003F-0000003F-00000000 [SL 00]

Cache information can be found in cpuid leaf 0x4. The Intel SDM Volume 2 discusses how to decode these bytes. Bits 31 - 22 of EBX (the second from the left) represent the number of ways minus one. These bits in binary are 1011, which is 11 in decimal. So cpuid says that there are 12 ways. Other information we can obtain from here is that the L1 data cache is 48KB in size, with 64-byte cache line size, and uses the simple addressing scheme. So based on the cpuid information, bits 11-6 of the address represent the cache set index.

So which one is right? The optimization manual could be wrong (and that wouldn't be the first time), but also the cpuid dump could be buggy (and that also wouldn't be the first time). Well, both could be wrong, but this is historically much less likely. Other examples of discrepancies between the manual and cpuid information are discussed here, so we know that errors exist in both sources. Moreover, I'm not aware of any other Intel source that mentions the number of ways in the L1D. Of course, non-Intel sources could be wrong as well.

Having 8 ways with 96 sets would result in an unusual design and unlikely to happen without more than a mere mention of a single number in the optimization manual (although that doesn't necessarily mean that the cache has to have 12 ways). This by itself makes the manual more likely to be wrong here.

Fortunately, Intel does document implementation bugs in their processors in the spec update documents. We can check with spec update document for the Ice Lake processors, which you can find here. Two cpuid bugs are documented there:

CPUID TLB Information is Inaccurate

I've already discussed this issue in my answer on Understanding TLB from CPUID results on Intel. The second bug is:

CPUID L2 Cache Information May Be Inaccurate

This is not relevant to your question.

The fact that the spec update document mentions some cpuid bugs strongly suggests that the information from cpuid leaf 0x4 was validated by Intel and is accurate. So the optimization manual (and the datasheet) is probably wrong in this case.

Gains answered 20/1, 2020 at 5:46 Comment(7)
having 8 ways with 96 sets would result in an unusual design - That's a pretty major understatement, isn't it? Intel has always stuck with VIPT = PIPT L1d caches. Even without the CPUID info, I would consider an error in the optimization manual the most likely explanation. Unless you have an implementation technique in mind that allows a non-power-of-2 number of sets and avoids aliasing problems?Bethlehem
@PeterCordes Intel always makes major changes in each new microarchitecture. In Ice Lake, adding a new store pipe is a huge change. So if Intel has done something in the past, it doesn't mean that it will continue to do it in the future. Yes, there are many implementation techniques that either avoid or deal with the aliasing problems. Regarding, the non-power-of-2, there are ways to handle this as well. You could for example have a split data cache design where the total number of sets not a power-of-2.Gains
@PeterCordes Intel will eventually has to deal with these problems anyway. One way would be to increase the size of the smallest page to 64KB, for example, and emulate 4KB pages using 64KB pages. This would avoid the aliasing problems, but it's a trade-off.Gains
Ok fair point, Intel could have done a major redesign of L1d. But how much larger is it practical to make L1d and still maintain the low load-use latency that pointer-based data structures rely on? Maybe if not for the VIPT problem, they might have gone 64k / 8-way? They do have a private per-core L2 so I don't think they'd want to slow down the common case. I had assumed Intel would keep L1d about this size for as long as they keep making x86 CPUs. (Along with other downsides of a 4k page size, like needing significant space for page tables if you don't use hugepages.)Bethlehem
@PeterCordes Yes, latency could be an issue, and a split large data cache design can alleviate it. The 4KB page size is not ideal anymore as the smallest page size. Intel has a patent on how to emulate 4KB pages using larger pages. See: #11544248. Removing native support for 4KB pages would help with the VIPT problem and make more bits available for cache indexing, while still maintaining PIPT.Gains
It's going to be a lot of years before Intel can fully remove 4k page support from mainstream HW. I could imagine them (in several years) selling a CPU where only half the sets in L1d are usable if legacy 4k page support is enabled, so you need an up to date OS to get full advantage. (And not running any user-space that requires the OS to let it use 4k pages for mmap). Like 48k / 12-way vs. 96k / 12-way. I guess tags could include bit 12 to support the 12-bit page-offset mode of operation.Bethlehem
Very nice answer, well researched! In the end, I've accepted Andreas' due to personal taste but yours would also deserve to be accepted.Neediness

© 2022 - 2024 — McMap. All rights reserved.