I've been reading through the "What every programmer should know about memory" paper and got confused with the measurements performed on pages 20-21 of the document.
I don't really understand the meaning of the cycles per array element measurements. For example, it is said in the paper that a L1d cache size is 16kB on the machine that ran the tests, so when the working set size is below 2^14 B, that is 16kB, the number of cycles to process a single array element is around 4.
Now, how do I interpret that? I can assume that since the entire working set can fit into L1d cache, there is no need to fetch data from L2 cache, so the access time for every single element is 4 cycles, which is expected.
The thing I don't understand is why when the working set size becomes larger than the L1d cache size (for example, 2^15 B, that is 32kB), the number of cycles to process a single element grows from 4 to 9.
Still, I can assume that some part of the array, that fits into L1d cache, will be accessed from there, and the rest will be in L2. So for some elements the number of cycles needed will still be 4 and for those in L2 cache this number will grow.
Is that assumption incorrect and all of the array elements will be accessed from L2, since working set doesn't fit into L1? Otherwise, we can't say that it takes 9 cycles to process every single array element, since it is 4 for the array part in L1d and 9 for rest of the array in L2.
Also, obviously the first element will take longer to process, since the processor has to load the cache line with that element from main memory in the first place. Is it some kind of average processing time depicted in the paper?
Moreover, there are most likely other programs running on the system which also occupy space in L1d/L2 caches, so it is probably wrong to assume that entire 16kB of the L1d cache will be submitted to the program which is running the test.
With all these concernes in mind, how do I interpret the results of this test?