Linux C++: how to profile time wasted due to cache misses?

Asked 21/3, 2010 at 11:18 Answered 12/7, 2013 at 10:37

I know that I can use gprof to benchmark my code.

However, I have this problem -- I have a smart pointer that has an extra level of indirection (think of it as a proxy object).

As a result, I have this extra layer that effects pretty much all functions, and screws with caching.

Is there a way to measure the time my CPU wastes due to cache misses?

Oxley answered 21/3, 2010 at 11:18 Comment(0)

Linux supports with perf from 2.6.31 on. This allows you to do the following:

compile your code with -g to have debug information included
run your code e.g. using the last level cache misses counters: perf record -e LLC-loads,LLC-load-misses yourExecutable
run perf report
- after acknowledging the initial message, select the LLC-load-misses line,
- then e.g. the first function and
- then annotate. You should see the lines (in assembly code, surrounded by the the original source code) and a number indicating what fraction of last level cache misses for the lines where cache misses occurred.

Impotence answered 12/7, 2013 at 10:37 Comment(3)

Where can one find a list of events such as this? – Lareelareena 20/6, 2017 at 23:56

perf list prints the list of available events. The list of events depends on the machine you are running on. On a virtual machine I logged on I get 10 software events and two 'Raw hardware event descriptor' events. On a physical machine I just logged on (Xeon E5), I get 26 'Hardware cache event' types, 10 'Hardware event' types, 28 'Kernel PMU event' types, 10 'Software event' types, two 'Raw hardware event descriptor' events and one 'Hardware breakpoint' type. – Impotence 21/6, 2017 at 8:58

This is useful. I tried to follow you steps and in the last step, I kind of lost track because it showed the asm line that tries to call a templatized function. I wonder how I can narrow down further (i.e. which asm line inside that templatized function generated most of the LLC misses). When I tried to hit "enter" to "step into" the function, it says, "The called function was not found". Any suggestion? My feeling is that all the lines insides that templatized functions are all inlined. – Upthrust 3/4, 2023 at 13:32

You could try cachegrind and it's front-end kcachegrind.

Territoriality answered 21/3, 2010 at 11:53 Comment(0)

You could find a tool that accesses the CPU performance counters. There is probably a register in each core that counts L1, L2, etc misses. Alternately Cachegrind performs a cycle-by-cycle simulation.

However, I don't think that would be insightful. Your proxy objects are presumably modified by their own methods. A conventional profiler will tell you how much time those methods are taking. No profile tool would tell you how performance would improve without that source of cache pollution. That's a matter of reducing the size and structure of the program's working set, which isn't easy to extrapolate.

A quick Google search turned up boost::intrusive_ptr which might interest you. It doesn't appear to support something like weak_ptr, but converting your program might be trivial, and then you would know for sure the cost of the non-intrusive ref counts.

Halidom answered 21/3, 2010 at 11:39 Comment(2)

Indeed it's not possible to use a weak_ptr with an Intrusive Counter as an intrusive counter is destroyed with the object... and so the weak_ptr has no way to check whether or not the object is valid without actually accessing it. – Voleta 21/3, 2010 at 13:7

@Matthieu: If the dependency graph is known to be a single cycle, I think you can use each object's link (there must be only one) as a validity flag. For the purpose of destruction anyway. Traversing a random graph would require thread-local storage, but that's not impossible. – Halidom 21/3, 2010 at 14:30

Continuing along the lines of @Mike_Dunlavey's answer:

First, obtain a time based profile, using your favorite tool: VTune or PTU or OProf.

Then, obtain a cache miss profile. L1 cache misses, or L2 cache misses, or ...

I.e. the first profile associates a "time spent" with each program counter. The second associates a "number of cache misses" value with each program counter.

Note: I often "reduce" the data, summing it up by function, or (if I have the technology) by loop. Or by bins of, say, 64 bytes. Comparing individual program counters is often not useful, because the performance counters are fuzzy - the place where you see a cache miss get reported is often several instructions different from where it actually happened.

OK, so now graph these two profiles to compare them. Here are some graphs that I find useful:

"Iceberg" charts: X axis is PC, positive Y axis is time, negative Y access is cache misses. Look for places that go both up and down.

("Interleaved" charts are also useful: same idea, X axis is PC, plot both time and cache misseson Y axis, but with narrow vertical lines of different colors, typically red and blue. Places where is a lot of both time and cache misses spent will have finely interleaved red and blue lines, almost looking purple. This extends to L2 and L3 cache misses, all on the same graph. By the way, you probably want to "normalize" the numbers, either to %age of total time or cache misses, or, even better, %age of the maximum data point of time or cache misses. If you get the scale wrong, you won't see anything.)

XY charts: for each sampling bin (PC, or function, or loop, or...) plot a point whose X coordinate is the normalized time, and whose Y coordinate is the normalized cache misses. If you get a lot of data points in the upper right hand corner - large %age time AND large %age cache misses - that is interesting evidence. Or, forget number of points - if the sum of all percentages in the upper corner is big...

Note, unfortunately, that you often have to roll these analyses yourself. Last I checked VTune does not do it for you. I have used gnuplot and Excel. (Warning: Excel dies above 64 thousand data points.)

More advice:

If your smart pointer is inlined, you may get the counts all over the place. In an ideal world you would be able to trace back PCs to the original line of source code. In this case, you may want to defer the reduction a bit: look at all individual PCs; map them back to lines of source code; and then map those into the original function. Many compilers, e.g. GCC, have symbol table options that allow you to do this.

By the way, I suspect that your problem is NOT with the smart pointer causing cache thrashing. Unless you are doing smart_ptr<int> all over the place. If you are doing smart_ptr<Obj>, and sizeof(Obj) + is greater than say, 4*sizeof(Obj*) (and if the smart_ptr itself is not huge), then it is not that much.

More likely it is the extra level of indirection that the smart pointer does that is causing yor problem.

Coincidentally, I was talking to a guy at lunch who had a reference counted smart pointer that was using a handle, i.e. a level of indirection, something like

template<typename T> class refcntptr {
    refcnt_handle<T> handle;
public:
    refcntptr(T*obj) {
        this->handle = new refcnt_handle<T>();
        this->handle->ptr = obj;
        this->handle->count = 1;
    }
};
template<typename T> class refcnt_handle {
    T* ptr;
    int count;
    friend refcnt_ptr<T>;
};

(I wouldn't code it this way, but it serves for exposition.)

The double indirection this->handle->ptr can be a big performance problem. Or even a triple indirection, this->handle->ptr->field. At the least, on a machine with 5 cycle L1 cache hits, each this->handle->ptr->field would take 10 cycles. And be much harder to overlap than a single pointer chase. But, worse, if each is an L1 cache miss, even if it were only 20 cycles to the L2... well, it is much harder to hide 2*20=40 cycles of cache miss latency, than a single L1 miss.

In general, it is good advice to avoid levels of indirection in smart pointers. Instead of pointing to a handle, that all smart pointers point to, which itself points to the object, you might make the smart pointer bigger by having it point to the object as well as the handle. (Which then is no longer what is commonly called a handle, but is more like an info object.)

E.g.

template<typename T> class refcntptr {
    refcnt_info<T> info;
    T* ptr;
public:
    refcntptr(T*obj) {
        this->ptr = obj;
        this->info = new refcnt_handle<T>();
        this->info->count = 1;
    }
};
template<typename T> class refcnt_info {
    T* ptr; // perhaps not necessary, but useful.
    int count;
    friend refcnt_ptr<T>;
};

Anyway - a time profile is your best friend.

Oh, yeah - Intel EMON hardware can also tell you how many cycles you waited at a PC. That can distinguish a large number of L1 misses from a small number of L2 misses.

Olivine answered 28/4, 2012 at 1:48 Comment(4)

+ As long as we're on the subject, here's what I wish the CPU would do: When it enters a cache-wait (L1 or L2) it records the PC that triggered it. Then if an interrupt has been requested, it should allow it to be answered, either during or just after the cache-wait, and let that information be queried. (It's OK if it causes the wait to restart.) Then if software wants to get lots of those and sum them up, it can. Personally I recommend against the summing-by-region part, because all that does is smear out the precise locations that nail the problem. – Gaitskell 29/7, 2013 at 13:56

Unfortunately, on many machines the PC (Ip in x86 land) of the instruction that caused the cache miss is not available. The PC (IP) of the instruction that is stalling retirement waiting for the cache miss to come back is available, but not the PC that it is waiting on. We don't even have, easy to hand, what the stalling instruction is waiting on - all it knows is that it is waiting on an input register. // Now, some more recent processors route the PC all the way to the memory units, to do stuff like STLF prediction. If I were still at Intel I'd have done this already. – Olivine 10/8, 2013 at 22:52

AMD IBS (Instruction ased Sampling) is relevant here. – Olivine 10/8, 2013 at 22:58

I looked that up - impressive. It amazes me how chip jockeys tune their stuff to ever higher performance (though they may be hitting a limit), while software developers go the other way. (Of course it's amusing how they always end up using matrix multiplication as the canonical speed test, when real mega-liners waste tons of time simply with not-really-necessary function calls.) – Gaitskell 11/8, 2013 at 2:44

It depends on what OS and CPU you are using. E.g. for Mac OS X and x86 or ppc, Shark will do cache miss profiling. Ditto for Zoom on Linux.

Hardman answered 21/3, 2010 at 11:54 Comment(4)

Unfortunately, Shark isn't supported by Apple anymore, so it won't work for much longer as new OSX/iOS versions come out. Zoom is still actively developed/updated and got a major release just a few months ago. – Quesada 12/9, 2011 at 17:57

@federal: indeed - it's sad that Shark has now lapsed into a state of neglect - Instruments just doesn't cut it for the kind of tasks that Shark excelled at. I've decided to stick with Snow Leopard and Xcode 3.2.6 for as long as I realistically can - maybe in a year or two Xcode 4.x and Instruments will be more useful. – Hardman 12/9, 2011 at 20:4

UPDATE: Zoom (v3.0) now profiles on Mac OS X and Linux. Shark is officially gone. – Quesada 22/10, 2012 at 18:49

Also Instruments has improved somewhat, but I think it still doesn't do cache miss profiling. – Hardman 22/10, 2012 at 21:50

If you're running an AMD processor, you can get CodeAnalyst, apparently free as in beer.

Intone answered 21/3, 2010 at 12:21 Comment(0)

My advice would be to use PTU (Performance Tuning Utility) from Intel.

This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile. And of course you'll be able to gather all cache line misses events you are looking for.

Motor answered 21/3, 2010 at 11:47 Comment(3)

Problem is, cache pollution will cause misses all over the place. What pattern is there to look for? – Halidom 21/3, 2010 at 11:50

First thing you need to find out is: Is there really a problem in your particular application. Profile your application as a user would use it, then check in the report where your bottlenecks are located. You might find a high number of L2 Cache Line Miss, but it might be caused by other parts of your application and discover other issues that you were not worried about. It does not mean that you do not have a problem with your smart pointers, but it is hidden behind more pressing bottlenecks. Tell me how it goes I'd be happy to help on any performance issue. – Motor 21/3, 2010 at 15:45

@Halidom - noticed comment years later. WRT patterns, I look for frequency patterns. Eg. If you have data addresses, histogram by associative sets in L1/L2/TLB. Try a Fourier Transform to look for frequencies. Those are spatial - if you have time of miss, you can look for time and spatial frequencies. – Olivine 16/1, 2019 at 17:15

Another tool for CPU performance counter-based profiling is oprofile. You can view its results using kcachegrind.

Wormeaten answered 21/3, 2010 at 12:28 Comment(0)

Here's kind of a general answer.

For example, if your program is spending, say, 50% of it's time in cache misses, then 50% of the time when you pause it the program counter will be at the exact locations where it is waiting for the memory fetches that are causing the cache misses.

Gaitskell answered 23/3, 2010 at 1:32 Comment(4)

However, spending 50% of the time with the program counter at retirement stalled pointing to a cache miss is NOT the same as saying that the program is spending 50% of its time in cache misses, or that the program would double in speed if all cache misses were removed. In a speculative machine a cache miss may not stall retirement, but may stall another instruction that itself stalls retirement. – Olivine 28/7, 2013 at 22:23

@KrazyGlew: I'm sure you're right that the chip architects, in their laudable quest for speed, have made it less than easy to track down just which code is causing them headaches :) – Gaitskell 28/7, 2013 at 22:31

@Mike_Dunlavey: mea culpa. Both as a CPU architect, and as the CPU architect responsible for making it possible to profile on cache misses (EMON profiling). But, I'm also a performance analyst, was a perf analyst before I became a CPU architect. If I knew of a cheap way to measure exactly how much time is being lost to cache misses, I would have built it. In the mean time, the best I can think of doing is to describe what is being measured. – Olivine 28/7, 2013 at 22:35

@KrazyGlew: The last time I worked directly with chip-level guys (BCE) they wanted to help me out with measurement timers of one sort or another. I said thanks but no thanks. Measurement doesn't tell me what I need to know. What I need to know is, at a random time of my choosing, What Is It Waiting For Right Now (WIIWFRN :) See the difference? If I ask it that question, at 10 random times, and it tells me on 3 of those times it's in cache wait due to IP=foo, I know exactly what code needs to be fixed to pick up 30% (roughly). An ICE sufficed, back then. – Gaitskell 29/7, 2013 at 1:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags