Why number of processor cycles required to process a single array element grows with the working set (array) size?
Asked Answered
Y

0

1

I've been reading through the "What every programmer should know about memory" paper and got confused with the measurements performed on pages 20-21 of the document.

Sequential Read Access

I don't really understand the meaning of the cycles per array element measurements. For example, it is said in the paper that a L1d cache size is 16kB on the machine that ran the tests, so when the working set size is below 2^14 B, that is 16kB, the number of cycles to process a single array element is around 4.

Now, how do I interpret that? I can assume that since the entire working set can fit into L1d cache, there is no need to fetch data from L2 cache, so the access time for every single element is 4 cycles, which is expected.

The thing I don't understand is why when the working set size becomes larger than the L1d cache size (for example, 2^15 B, that is 32kB), the number of cycles to process a single element grows from 4 to 9.

Still, I can assume that some part of the array, that fits into L1d cache, will be accessed from there, and the rest will be in L2. So for some elements the number of cycles needed will still be 4 and for those in L2 cache this number will grow.

Is that assumption incorrect and all of the array elements will be accessed from L2, since working set doesn't fit into L1? Otherwise, we can't say that it takes 9 cycles to process every single array element, since it is 4 for the array part in L1d and 9 for rest of the array in L2.

Also, obviously the first element will take longer to process, since the processor has to load the cache line with that element from main memory in the first place. Is it some kind of average processing time depicted in the paper?

Moreover, there are most likely other programs running on the system which also occupy space in L1d/L2 caches, so it is probably wrong to assume that entire 16kB of the L1d cache will be submitted to the program which is running the test.

With all these concernes in mind, how do I interpret the results of this test?

Yettie answered 25/8, 2023 at 12:22 Comment(11)
"some part of the array, that fits into L1d cache" This is a good point. AFAIK, yes, it can, and it does on some modern processors. The way the CPU behave is strongly dependent of the algorithm use to replace cache lines. If the algorithm is not very clever, it will evict most cache lines so none can be reused and you pay the latency/throughput of the L2 cache. With a good algorithm, you get more smooth results because there are still L1 cache lines that can be reused. That being said at 2^16, the number of cache lines reused become negligible resulting in a similar behavior.Iceland
"other programs running on the system" Yes. The behavior depends of the OS and the running applications. Some OS flush the L1 cache (and AFAIK the TLB) when a new application is scheduled for security reasons, which is a disaster for performance since it causes a lot of cache misses when your application is running back. That being said OS also tries not to cause a context switch of a running application if another core is available. The quantum is sufficiently big for the scheduling effects not to be strongly visible on L1 statistics.Iceland
Besides, if the L1 is completely flush, the statistics are independent of the amount of L1 cache required by other applications. The only thing that matters is the number context switches and the time between them as well as the behavior of your application. Also please note that things get more complex with SMT threads (aka Hyper-threading) since 2 applications can run on the same core and share the same L1 cache. The effect of the other applications running on the same core is visible and can be used to forge side-channel attacks.Iceland
@JérômeRichard but in this particular benchmark a simple sequential access is made, hence none of the cache lines is really reused once all of the elements on the cache line are processed. Since the access is sequential, there is no way current cache line will be evicted from the cache until all of the elements are processed.Yettie
@JérômeRichard: L1d and L2 replacement use some form of pseudo-LRU in most CPUs, including Intel and AMD, not spending as many metadata bits and complexity as it would take to do full LRU for 8-way or 12-way associativity. L3 replacement in Intel CPUs uses an adaptive policy that tries to reduce pollution from accesses with low temporal locality that alias the same set as lines that do have long-term value. blog.stuffedcow.net/2013/01/ivb-cache-replacementHuntsville
@PeterCordes AFAIR, Intel uses a LFU cache replacement policy for relatively recent CPUs (certainly with some customization) and not just a pseudo-LRU. Though they are quite close, a frequency-based policy can have some strong impact on benchmarks having a non-uniform memory access distribution (certainly not on the one of the OP though). I was not aware of the adaptive policy for the L3. This strategy makes sense (and match well with benchmarks). The article you link is a bit old now (10-years old) so I guess it should have changed since IvyBridge (especially on Skylake or IceLake).Iceland
I was curious so I did some (simple quick) benchmarks on my i5-9600KF processor (Coffee-Lake). I used a loop accessing scalars 1-byte values with a stride of 64 performed N times. I get a sharp increase between 32 KiB and 36 KiB (about 1.75x slower). This matches with the specification of CoffeeLake : 32 KiB of L1 cache and a twice lower throughput for the L2. Strangely, between 36 and 80 KiB, the timings then become slightly faster (10%) and then slower to finally mostly stabilize at 64 KiB and completely at 80 KiB (with a 2x slower timings for the L2 compared to the L1).Iceland
For the L2, results show a far more smooth increase. They start to be quite unstable at 192 KiB. The beginning of the increase is visible at 240 KiB and its end is at ~300 KiB. The increase is almost linear (no sharp increase) and timings increase by a factor of 1.5x during this interval. Passed ~300 KiB, the timings are a bit better and then increase non linearly (unstable). They reach a maximum at 512 KiB and then become very stable even up to 4096 KiB (1.75x slower than for 128 KiB). The results matches with the L2 size of 256 KiB.Iceland
The timing in the L3 are a bit surprising since I expected a x2 rather than x1.75 but this is close to that and my benchmark is maybe not perfect... The increase between the L3 and the memory is not sharp at all. This is consistent with many other benchmarks. It is certainly due to a more advanced replacement policy algorithm, matching with the comment of PeterCordes.Iceland
I wonder if prefetchers could be partially responsible for the non-sharp increases (especially for the L2/L3). Prefetching should cause more misses due to more data loaded than useful. The target cache lines need to be reloaded impacting the cache replacement policy. I think these cache lines are more likely to stay in the cache (especially with a LFU policy) so they can be reused at least once. This would explain why the increase starts significantly before the cache threshold (same after). The higher the latency, the larger the amount of prefetched cache lines, the smoother the increase.Iceland
@JérômeRichard: Interesting. Just for the record, recall that the graph the OP is asking about is from Ulrich Drepper's article from Nov 2007, when CPUs were simpler. (Either P4 or Core 2: How much of ‘What Every Programmer Should Know About Memory’ is still valid?) A good answer to this question would also discuss if/how more recent CPUs differ.Huntsville

© 2022 - 2024 — McMap. All rights reserved.