Skylake L2 cache enhanced by reducing associativity?

In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine):

The cache hierarchy of the Skylake microarchitecture has the following enhancements:

Higher Cache bandwidth compared to previous generations.

Simultaneous handling of more loads and stores enabled by enlarged buffers.

Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations.

Page split load penalty down from 100 cycles in previous generation to 5 cycles.

L3 write bandwidth increased from 4 cycles pe r line in previous generation to 2 per line.

Support for the CLFLUSHOPT instruction to flush ca che lines and manage memory ordering of flushed data using SFENCE.

Reduced performance penalty for a software prefetch that specifies a NULL pointer.

L2 associativity changed from 8 ways to 4 ways.

The final one caught my eye. In what way is a reduction in the number of ways an enhancement? By itself, it seems that fewer ways is strictly worse than more ways. Of course, I get that there might be valid engineering reasons why a reduction in the number of ways could be a tradeoff that enables other enhancements, but here it is positioned, by itself, as an enhancement.

What am I missing?

It's strictly worse for performance of the L2 cache.

According to this AnandTech writeup of SKL-SP (aka skylake-avx512 or SKL-X), Intel has stated that "the main reason [for reducing associativity] was to make the design more modular". Skylake-AVX512 has 1MiB of L2 cache with 16-way associativity.

Presumably the drop to 4-way associativity doesn't hurt too badly in the dual and quad-core laptop and desktop parts (SKL-S), since there's lots of bandwidth to L3 cache. I think if Intel's simulations and testing had found that it hurt a lot, they would have put in the extra design time to keep the 8-way 256k cache on non-AVX512 Skylake.

The upside of lower associativity is power budget. It could indirectly help performance by allowing more turbo headroom, but mostly they did it to improve efficiency, NOT to improve speed. Freeing up some room in the power budget allows them to spend it elsewhere. Or not to spend all of it, and just use less power.

Mobile and many-core-server CPUs care a lot about power budget, much more than high-end quad-core desktop CPUs.

The heading on the list should more accurately read "changes", not "enhancements", but I'm sure the marketing department wouldn't let them write anything that didn't sound positive. :P At least Intel documents things accurately and in detail, including the ways new CPUs are worse than older designs.

Anandtech's SKL writeup suggests that dropping the associativity freed up the power budget to increase L2 bandwidth, which (in the big picture) compensates for the increased miss rate.

IIRC, Intel has a policy that any proposed design change must have a 2:1 ratio of perf gain to power cost, or something like that. So presumably if they lost 1% performance but save 3% power with this L2 change, they do it. The 2:1 number might be correct, if I'm remembering this correctly, but the 1% and 3% example are totally made up.

There was some discussion of this change in one of the podcast interviews David Kanter did right after details were released at IDF. IDK if this is the right link.

Recommended topics

Hot tags