On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Asked 29/9, 2018 at 5:9 Answered 29/9, 2018 at 5:9

performance x86 cpu-cache perf intel-pmu

Consider the following simple code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

#include <err.h>

int cpu_ms() {
    return (int)(clock() * 1000 / CLOCKS_PER_SEC);
}

int main(int argc, char** argv) {
    if (argc < 2) errx(EXIT_FAILURE, "provide the array size in KB on the command line");

    size_t size = atol(argv[1]) * 1024;
    unsigned char *p = malloc(size);
    if (!p) errx(EXIT_FAILURE, "malloc of %zu bytes failed", size);

    int fill = argv[2] ? argv[2][0] : 'x'; 
    memset(p, fill, size);

    int startms = cpu_ms();
    printf("allocated %zu bytes at %p and set it to %d in %d ms\n", size, p, fill, startms);

    // wait until 500ms has elapsed from start, so that perf gets the read phase
    while (cpu_ms() - startms < 500) {}
    startms = cpu_ms();

    // we start measuring with perf here
    unsigned char sum = 0;
    for (size_t off = 0; off < 64; off++) {
        for (size_t i = 0; i < size; i += 64) {
            sum += p[i + off];
        }
    }

    int delta = cpu_ms() - startms;
    printf("sum was %u in %d ms \n", sum, delta);

    return EXIT_SUCCESS;
}

This allocates an array of size bytes (which is passed in on the command line, in KiB), sets all bytes to the same value (the memset call), and finally loops over the array in a read-only manner, striding by one cache line (64 bytes), and repeats this 64 times so that each byte is accessed once.

If we turn prefetching off¹, we expect this to hit 100% in a given level of cache if size fits in the cache, and to mostly miss at that level otherwise.

I'm interested in two events l2_lines_out.silent and l2_lines_out.non_silent (and also l2_trans.l2_wb - but the values end up identical to non_silent), which counts lines that are silently dropped from l2 and that are not.

If we run this from 16 KiB up through 1 GiB, and measure these two events (plus l2_lines_in.all) for the final loop only, we get:

The y-axis here is the number of events, normalized to the number of accesses in the loop. For example, the 16 KiB test allocates a 16 KiB region, and makes 16,384 accesses to that region, and so a value of 0.5 means that on average 0.5 counts of the given event occurred per access.

The l2_lines_in.all behaves almost as we'd expect. It starts off around zero and when the size exceeds the L2 size it goes up to 1.0 and stays there: every access brings in a line.

The other two lines behave weirdly. In the region where the test fits in the L3 (but not in the L2), the eviction are nearly all silent. However, as soon as the region moves into main memory, the evictions are all non-silent.

What explains this behavior? It's hard to understand why the evictions from L2 would depend on whether the underlying region fits in main memory.

If you do stores instead of loads, almost everything is a non-silent writeback as expected, since the update value has to be propagated to the outer caches:

We can also take a look at what level of the cache the accesses are hitting in, using the mem_inst_retired.l1_hit and related events:

If you ignore the L1 hit counters, which seem impossibly high at a couple of points (more than 1 L1 hit per access?), the results look more or less as expected: mostly L2 hits when the the region fits cleanly in L2, mostly L3 hits for the L3 region (up to 6 MiB on my CPU), and then misses to DRAM thereafter.

You can find the code on GitHub. The details on building and running can be found in the README file.

I observed this behavior on my Skylake client i7-6700HQ CPU. The same effect seems not to exist on Haswell². On Skylake-X, the behavior is totally different, as expected, as the L3 cache design has changed to be something like a victim cache for the L2.

¹ You can do it on recent Intel with wrmsr -a 0x1a4 "$((2#1111))". In fact, the graph is almost exactly the same with prefetch on, so turning it off is mostly just to eliminate a confounding factor.

² See the comments for more details, but briefly l2_lines_out.(non_)silent doesn't exist there, but l2_lines_out.demand_(clean|dirty) does which seem to have a similar definition. More importantly, the l2_trans.l2_wb which mostly mirrors non_silent on Skylake exists also on Haswell and appears to mirror demand_dirty and it also does not exhibit the effect on Haswell.

Ewers answered 29/9, 2018 at 5:9 Comment(24)

What happens if you store one byte instead of loading one byte per iteration? I don't have a Skylake to use these events and run experiments on. – Smiga 29/9, 2018 at 6:44

It might be helpful to run two threads of the same process on two different physical cores each executing the same loop and accessing the same buffer and compare these counters per thread and the execution time of each thread (compared to when using a single thread). – Smiga 29/9, 2018 at 6:52

@Hadi Haswell has the same counters. – Ewers 29/9, 2018 at 7:15

According to the manual, L2_LINES_OUT.SILENT and L2_LINES_OUT.NON_SILENT are only available on Skylake. Or do they have different names on Haswell? They also seem to exist in older microarchs and called L2_LINES_OUT.DEMAND_CLEAN and L2_LINES_OUT.DEMAND_DIRTY. – Smiga 29/9, 2018 at 7:16

Oh, L2_LINES_OUT.DEMAND_CLEAN and L2_LINES_OUT.DEMAND_DIRTY exist on Haswell by these names. It's just that the umasks are different. – Smiga 29/9, 2018 at 7:21

On Haswell, l2_trans.l2_wb and l2_lines_out.non_silent are both almost zero for all array sizes. l2_lines_out.silent becomes flat at around 0.8 per access once the array becomes larger than the L2. l2_lines_in.all is as expected. – Smiga 29/9, 2018 at 8:44

@BeeOnRope: Is the data in your question from Skylake-client (inclusive L3), or Skylake-SP (NINE L3)? Is it plausible that clean L2 evictions are written-back to L3 like a victim cache would, if they weren't already hot in L3? – Ranket 29/9, 2018 at 10:54

@PeterCordes - Skylake client (i7-6700HQ). – Ewers 29/9, 2018 at 20:46

@HadiBrais - I added a store test. It behaves "as expected" on Skylake: almost all the lines are evicted non-silently as soon as the test exceeds the L1. New graph at the end of the question. – Ewers 29/9, 2018 at 21:28

@PeterCordes - I'm not quite following your thought about "Is it plausible that clean L2 evictions are written-back to L3 like a victim cache would, if they weren't already hot in L3?" - are you talking about the inclusive L3 case? I agree it makes sense to do a WB for the SKX victim L3 case. In SKL though, what does it mean for an unmodified line to be "written back" to L3? Just updating the LRU info in L3? – Ewers 29/9, 2018 at 21:31

@HadiBrais - sorry, you are right, those events don't exist by that name on Haswell. I checked and reproduced your result: Haswell doesn't show the same effect if you use demand_dirty and demand_clean in the place of non_silent and silent. So either the events are counting something different (i.e. it wasn't simply a name change), or Skylake behaves differently. On Haswell, the l2_trans.l2_wb event does have the same name but doesn't behave in the Skylake way (it is only slightly higher than demand_dirty and much less than clean), which is some evidence that in fact "Skylake is different". – Ewers 29/9, 2018 at 21:41

Here's Skylake-X, it looks very different of course as it has a victim L3 cache an a much larger L2 cache. Not much evidence of silent L2 lines out there at all which makes sense. – Ewers 29/9, 2018 at 21:51

Regarding the store case, is l2_lines_out.non_silent also identical to l2_trans.l2_wb? – Smiga 30/9, 2018 at 0:44

@HadiBrais - they are almost indistinguishable on the graph, if you plot them the l2_lines_out.non_silent overlaps it completely and obscures it. The counts are often identical or off by 1 or 2. The very small differences could be attributable to the non-atomic nature of perf reads of the performance counters. You can see the raw csv results here, columns 2 and 4. – Ewers 30/9, 2018 at 0:56

On Haswell, for the store case, l2_trans.l2_wb is about twice as large as L2_LINES_OUT.DEMAND_DIRTY. In both the store and load cases, it appears that the l2_trans.l2_wb events include all the events from L2_LINES_OUT.DEMAND_DIRTY. In addition, l2_trans.l2_wb is exclusive of L2_LINES_OUT.DEMAND_CLEAN. It's not clear to me what L2_LINES_OUT.DEMAND_DIRTY is counting. – Smiga 30/9, 2018 at 1:11

Regarding the case where the array fits in the L3 but not the L2, can you show the counts for the L3 hits and misses? We can determine whether the lines are already in the L3 and therefore the L2 is just dropping them because there is no need to write them back. Also for the case where the array does not fit in the L3. In this case, the L2 is writing all the (clean) lines back to the L3 in anticipation that they will be reused. But the L3 placement policy is complicated and so it's hard to predict what will happen without measuring the hit and miss events. – Smiga 30/9, 2018 at 1:29

When the array fits in the L3, the l2_lines_out.non_silent not zero. I think this means that initially, the lines are brought in the L2 but not the L3. But then the L2 starts writing back the line into the L3 until the L3 experiences a very high hit rate and then the L2 stops writing back the lines. Basically the L2 dynamically monitors the L3 hit rate. If it's low, it keep write back lines whether clean or dirty in an attempt to increase the hit rate of L3. If it's high, it stops writing back clean lines. This would explain the graph. – Smiga 30/9, 2018 at 1:39

@HadiBrais - here's what I get for Haswell wb vs demand_dirty. Definitely different but not 2x. I don't know what causes the difference. This is PF on, I'm chekcing now PF off. – Ewers 30/9, 2018 at 3:43

@HadiBrais - here's the same graph with prefetch off. The two series line up pretty much exactly (and there some other effects). So I think we can say the difference between the two in Haswell may be related to prefetching - this makes sense due to the "demand" part of the event name: maybe the "demand" even only counts lines out due to demand requests, and the wb counts also lines out due to evictions due to incoming PFs. – Ewers 30/9, 2018 at 3:56

@HadiBrais about "In this case, the L2 is writing all the (clean) lines back to the L3 in anticipation that they will be reused. But the L3 placement policy is complicated and so it's hard to predict what will happen without measuring the hit and miss events." - but the L3 is inclusive, so why would the L2 "write back" a clean line to the L3? It knows with 100% certainty that the same line is already there. – Ewers 30/9, 2018 at 3:57

@HadiBrais - I added FB, L1/2/3 mem_inst_retired.*_hit graphs, which is maybe what you were asking for hits and misses. The counts look "as expected" except that L1 hits are weird (but I think we can ignore that). – Ewers 30/9, 2018 at 4:20

@BeeOnRope: My thought was that if your data was from SKX, the non-inclusive nature of the L3 could maybe explain write-backs. (SKX L3 isn't a victim cache, though. AFAIK, a cache-miss load populates all 3 levels of cache. I was suggesting that it could also have the victim-cache-like behaviour of L2 eviction sending the line to L3). But since your data is from SKL, my guess doesn't apply at all. I don't think it makes sense to bump the L3 LRU on L2 eviction. – Ranket 30/9, 2018 at 4:53

I don't know if the SKX cache is a victim cache or not but it has been widely described as such, e.g., by wikichip. Anyways, I did collect data from SKX as above, and it is pretty 100% writebacks in the case of L2, which makes sense in a "victim" or "victim-like" L3 cache. IMO it seems unlikely that loads populate always both L2 and L3: the L3 size is barely larger than L2, so that would be a lot of duplication (but a lot depends on if L2 tags live in L3 to implement snoop filtering, etc...). – Ewers 30/9, 2018 at 5:1

No answer but this could possibly be related to why zero-over-zero store elemination falls off as region size gets larger as well. – Relax 13/4, 2021 at 21:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags