What is the Cost of an L1 Cache Miss?
Asked Answered
P

8

76

Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit

I did some testing <long story goes here> and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.

I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run “instantly”).

*** Testing 20000 nodes
Total Ordered Time: 888.822899
Total Random Time: 2155.846268

Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).

typedef struct stItem
{
   long     lData;
   //char     acPad[20];
} LIST_NODE;



#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
   QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2, llFreq;

   QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
   QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
   *pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
   // Just use clock(), this test doesn't need higher resolution
   *pt1 = clock();
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2 = clock();
   *pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif



long longrand()
{
   #if defined( WIN32 )
   // Stupid cheesy way to make sure it is not just a 16-bit rand value
   return ( rand() << 16 ) | rand();
   #else
   return rand();
   #endif
}

// get random value in the given range
int randint( int m, int n )
{
   int ret = longrand() % ( n - m + 1 );
   return ret + m;
}

// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
   long *plShuffle,  // (O) return array of "randomly" ordered integers
   long lNumItems    // (I) length of array
)
{
   long i;
   long j;
   long t;

   for ( i = 0; i < lNumItems; i++ )
      plShuffle[i] = i;

   for ( i = 0; i < lNumItems; i++ )
      {
      j = randint( i, lNumItems - 1 );

      t = plShuffle[i];
      plShuffle[i] = plShuffle[j];
      plShuffle[j] = t;
      }
}



int main( int argc, char* argv[] )
{
   long          *plDataValues;
   LIST_NODE     *pstNodes;
   long          lNumItems = 20000;
   long          i, j;
   LONGLONG      t1;  // for timing
   double dms;

   if ( argc > 1 && atoi(argv[1]) > 0 )
      lNumItems = atoi( argv[1] );

   printf( "\n\n*** Testing %u nodes\n", lNumItems );

   srand( (unsigned int)time( 0 ));

   // allocate the nodes as one single chunk of memory
   pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
   assert( pstNodes != NULL );

   // Create an array that gives the access order for the nodes
   plDataValues = (long*)malloc( lNumItems * sizeof( long ));
   assert( plDataValues != NULL );

   // Access the data in order
   for ( i = 0; i < lNumItems; i++ )
      plDataValues[i] = i;

   StartTimer( &t1 );

   // Loop through and access the memory a bunch of times
   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Ordered Time: %f\n", dms );

   // now access the array positions in a "random" order
   ShuffleArray( plDataValues, lNumItems );

   StartTimer( &t1 );

   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Random Time: %f\n", dms );

}
Portillo answered 14/7, 2009 at 16:25 Comment(4)
His question is: "Do these numbers make sense?"Henbit
Sorry - I kind of buried the question in too much text. But yes, the question is if the numbers make sense. Are 10 cycles for an L1 cache miss about right?Portillo
You should take a read of "What every programmer should know about memory" by Ulrich Drepper - it goes deep into the timing of memory access, and access-pattern and cache interactions.Akerley
The Igor Ostrovsky linked to in the question is excellent. +1 just for directing me to that.Val
M
27

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

Moue answered 15/7, 2009 at 8:28 Comment(1)
Very nice. Thanks for the pointer to it. I've been aware of Valgrind but haven't used it before (most of my development is on Win32). I just now ran it on a Linux box and it reported a 41% miss rate for the "random" portion of the test. And the "in order" portion of the test had a negligible miss rate. Neither portion had any L2 miss rate to speak of.Portillo
E
74

Here is an attempt to provide insight into the relative cost of cache misses by analogy with baking chocolate chip cookies ...

Your hands are your registers. It takes you 1 second to drop chocolate chips into the dough.

The kitchen counter is your L1 cache, twelve times slower than registers. It takes 12 x 1 = 12 seconds to step to the counter, pick up the bag of walnuts, and empty some into your hand.

The fridge is your L2 cache, four times slower than L1. It takes 4 x 12 = 48 seconds to walk to the fridge, open it, move last night's leftovers out of the way, take out a carton of eggs, open the carton, put 3 eggs on the counter, and put the carton back in the fridge.

The cupboard is your L3 cache, three times slower than L2. It takes 3 x 48 = 2 minutes and 24 seconds to take three steps to the cupboard, bend down, open the door, root around to find the baking supply tin, extract it from the cupboard, open it, dig to find the baking powder, put it on the counter and sweep up the mess you spilled on the floor.

And main memory? That's the corner store, 5 times slower than L3. It takes 5 x 2:24 = 12 minutes to find your wallet, put on your shoes and jacket, dash down the street, grab a litre of milk, dash home, take off your shoes and jacket, and get back to the kitchen.

Note that all these accesses are constant complexity -- O(1) -- but the differences between them can have a huge impact on performance. Optimizing purely for big-O complexity is like deciding whether to add chocolate chips to the batter 1 at a time or 10 at a time, but forgetting to put them on your grocery list.

Moral of the story: Organize your memory accesses so the CPU has to go for groceries as rarely as possible.

Numbers were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:

  • register access = 4 instructions per cycle
  • L1 latency = 3 cycles (12 x register)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
  • DRAM latency = 65 ns = 195 cycles on a 3 GHz CPU (5 x L3, 15 x L2, 60 x L1, 720 x register)

The Gallery of Processor Cache Effects also makes good reading on this topic.

Mmmm, cookies ...

Euphrosyne answered 21/3, 2015 at 21:56 Comment(3)
That 1 in O(1) is always a drag. nice answer, should have been the accepted!Plainclothesman
Great answer! Additionally, this could be extended to multiple kitchenettes (cores) that share the same cupboards (L3 cache); if one cook goes to the store for more flour, all the others can grab it from there.Skip
I would also add: in the case of virtual memory, an access to a swapped page (i.e. one that requires data to be read in from disk) is like finding the store is out of stock for that cinnamon powder, and they need to order a new batch in from China - with a 6-week shipping period.Moustache
M
27

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

Moue answered 15/7, 2009 at 8:28 Comment(1)
Very nice. Thanks for the pointer to it. I've been aware of Valgrind but haven't used it before (most of my development is on Win32). I just now ran it on a Linux box and it reported a 41% miss rate for the "random" portion of the test. And the "in order" portion of the test had a negligible miss rate. Neither portion had any L2 miss rate to speak of.Portillo
N
18

3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.

Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.

Norling answered 15/7, 2009 at 8:27 Comment(0)
H
6

Well yeah that does look like it will mainly be L1 cache misses.

10 cycles for an L1 cache miss does sound about reasonable, probably a little on the low side.

A read from RAM is going to take of the order of 100s or may be even 1000s (Am too tired to attempt to do the maths right now ;)) of cycles so its still a huge win over that.

Hypercorrect answered 14/7, 2009 at 16:30 Comment(5)
"a little on the low side" - with 80K of data and 32K of L1, you'd be disappointed if every fetch missed cache, so a little low makes sense to me.Recluse
good point .. and the fact that the order has been randomised means that there must be about 50/50 cache misses to hits. Of course it'd be nice and easy to come up with a read pattern that would mean every access missed :)Hypercorrect
I agree - good point. If the cache is 32K and it is largely dedicated to holding the array, then maybe 40% of the references would be hits. So a 60% miss rate would take the cost up to about 17 cycles per miss (again assuming my math is correct).Portillo
sandpile.org/impl/p4.htm suggests that the latency for an L2 Cache read from a 90 to 65nm P4 is between 18 and 20 cycles. So Mark's quick calculation above appears pretty spot on :)Hypercorrect
In fact assuming 18 cycles per miss and plugging that in that gives us a value of around 56.3% L1 cache misses and assuming 20 cycles gives us a value of 50.6% L1 Cache misses.Hypercorrect
S
4

If you plan on using cachegrind, please note that it is a cache hit/miss simulator only. It won't always be accurate. For example: if you access some memory location, say 0x1234 in a loop 1000 times, cachegrind will always show you that there was only one cache miss (the first access) even if you have something like:

clflush 0x1234 in your loop.

On x86, this will cause all 1000 cache misses.

Silicic answered 14/11, 2011 at 23:18 Comment(2)
Could you plz explain why it woult take 1000 cache misses on x86Derwent
If this is true, could cachegrind not simply add support for the clflush instruction to their cache simulation?Arlberg
D
2

Some numbers for a 3.4GHz P4 from a Lavalys Everest run:

  • the L1 dcache is 8K (cacheline 64 bytes)
  • L2 is 512K
  • L1 fetch latency is 2 cycles
  • L2 fetch latency is about double what you are seeing: 20 cycles

More here: http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

(for the latencies look at the bottom of the page)

Diastase answered 15/7, 2009 at 14:6 Comment(0)
H
0

It's difficult to say anything for sure without a lot more testing, but in my experience that scale of difference definitely can be attributed to the CPU L1 and/or L2 cache, especially in a scenario with randomized access. You could probably make it even worse by ensuring that each access is at least some minimum distance from the last.

Henbit answered 14/7, 2009 at 16:31 Comment(0)
M
-3

The easiest thing to do is to take a scaled photograph of the target cpu and physically measure the distance between the core and the level-1 cache. Multiply that distance by the distance electrons can travel per second in copper. Then figure out how many clock-cycles you can have in that same time. That's the minimum number of cpu cycles you'll waste on a L1 cache miss.

You can also work out the minimum cost of fetching data from RAM in terms of the number of CPU cycles wasted in the same way. You might be amazed.

Notice that what you're seeing here definitely has something to do with cache-misses (be it L1 or both L1 and L2) because normally the cache will pull out data on the same cache line once you access anything on that cache-line requiring less trips to RAM.

However, what you're probably also seeing is the fact that RAM (even though it's calls Random Access Memory) still preferres linear memory access.

Monoceros answered 15/7, 2009 at 8:16 Comment(5)
<pendant> The speed of an electron does not relate to the speed of the current / voltage. Electrons move really slowly. </pedant>Improvisation
Yeah, it's more to do with capacitance and how long the ringing takes to settle down.Norling
@Skizz, could you show me how to convert those units into seconds so I can work that into the answer?Monoceros
The very least you could do is include the speed of an electrical wave in copper, which is IIRC about 0.6c (close enough for this purpose)Yellowknife
This would make sense if cache accesses were clock-less asynchronous circuits. Real processors are pipelined and changes only happen at clock edges, and load/store pipelines have pipeline registers that operate deterministically. The physical distances are relevant only to the engineers that are designing the silicon. One reason for the latencies is physical distance, sure, but you cannot determine latencies from die photographs.Ostosis

© 2022 - 2024 — McMap. All rights reserved.