How to distinguish between a CPU and a memory bottleneck?
Asked Answered
T

1

7

I want to filter some data (100s of MB to a few GBs). The user can change the filters so I have to re-filter the data frequently.

The code is rather simple:

std::vector<Event*> filteredEvents;
for (size_t i = 0; i < events.size(); i++){
    const auto ev = events[i];

    for (const auto& filter : filters) {
        if (filter->evaluate(ev)) {
            filteredEvents->push_back(ev);
            break;
        }
    }

    if (i % 512 == 0) {
        updateProgress(i);
    }
}

I now want to add another filter. I can do that either in a way that uses more CPU or that uses more memory. To decide between the two I would like to know what the bottleneck of the above loop is.

How do I profile the code to decide if the bottleneck is the cpu or the memory?


In case it matters, the project is written in Qt and I use Qt Creator as the idea. The platform is Windows. I am currently using Very Sleepy to profile my code.

Tiler answered 13/12, 2019 at 8:55 Comment(6)
The CPU can keep performance counters which can indicate whether the CPU is e.g. stalled waiting on memory or limited by the throughput of some port. But I don't know if Very Sleepy can work with those. Intel VTune is probably the go-to Windows tool for that, but it's far from free. See #34642144 for a few more options.Elf
It's entirely possible that the result is CPU-dependent. The code above also appears single-threaded, which might be another factor which affects the trade-off.Primogeniture
Some basic observations: 1) std::vector#push_back will resize (copy) when it gets full. This can hurt if filteredEvents gets large. 2) For every element you need to follow the Event pointer. If events are not located contiguously in memory that is (very probably) a cache miss every time. 3) I'm guessing filter::evaluate is a virtual function? If you have more than a few filter types your branch predictor will be very unhappy.Rudbeckia
I would replace ` const auto ev = events[i];` with const auto &ev. You don't have to allocate new memory for this variable, unless you are working on the events with other thread, which in this case means that you have no critical area protection.Gisellegish
Let's not get lost in the details here, the real code is slightly more complex (e.g. the vector reserves memory up front; I also doubt that references are better than copying a pointer). Changing the code requires that I know what I should optimize for. For that I need to figure out how to profile the code.Quadratics
If, for example, you expand a memory structure so that it no longer fits in a cache line where it did before, and performance suffers as a consequence, is that then a CPU or memory bottleneck?Rovelli
P
1

Intel vTune is a free graphical tool available for all of Windows and Linux and macOS. If you want something easy to set up, straightforward to read, and you're using Intel CPU, I'd say vTune is a good choice. It can automatically give you suggestions on where your bottleneck is (core vs memory).

Under the hood I believe Intel vTune is collecting a bunch of PMU (performance monitoring unit) counter values, LBR, stack information etc. On Linux, you are more than welcome to utilize the Linux Perf tool and collect performance stats for yourself. For example, using perf record + perf report in tandem tells you the hotspot of your application. But if you're concerned about other metrics, for example cache miss behavior, you'll have to explicitly tell perf what performance counter to collect. perf mem is able to address some of that need. But afterall, Linux Perf is a lot more "hard core" than the graphical Intel vTune, you better know what counter values to look for if you want to make good use of Linux Perf - sometimes one counter will directly give you the metric you want to collect, other times you have to do some computation on several counter values to get your desired metric. Use perf list to appreciate how detailed it can profile your machine and system's performance.

Pitzer answered 13/12, 2022 at 3:25 Comment(1)
As for what to look for: cache miss rate and stalls due to memory, like the cycle_activity.stalls_l2_miss perf event. Also en.wikipedia.org/wiki/Roofline_model re: memory bandwidth vs. ALU throughput limits.Vaas

© 2022 - 2024 — McMap. All rights reserved.