Is valgrind's cachegrind still the go-to tool in 2021?
Asked Answered
C

0

6

I'm a long-time user of cachegrind for program profiling, and recently went back to check the official documentation once more: https://valgrind.org/docs/manual/cg-manual.html

In it, there are multiple references to CPU models, implementation decisions and simulation models that are all from the mid-2000s, and there are also statements that some behavior changed on "modern" processors:

the LL cache typically replicates all the entries of the L1 caches [...] This is standard on Pentium chips, but AMD Opterons, Athlons and Durons use an exclusive LL cache [...]

Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.

More recent processors have better branch predictors [...] Cachegrind's predictor design is deliberately conservative so as to be representative of the large installed base of processors which pre-date widespread deployment of more sophisticated indirect branch predictors. In particular, late model Pentium 4s (Prescott), Pentium M, Core and Core 2 have more sophisticated indirect branch predictors than modelled by Cachegrind.

Now I'm wondering

  • how many of these choices still apply in 2021 when developing on latest-gen CPUs,
  • whether the implementation of cachegrind has been updated to reflect latest CPUs, but the manual is outdated,
  • whether cachegrind shows skewed results on modern CPUs due to its simulation of legacy behavior.

Any insight is greatly appreciated!

Cachalot answered 26/6, 2021 at 2:48 Comment(2)
The biggest modern development in branch prediction is TAGE, which indexes based on the pattern of recent branches, and thus can "learn" complex patterns, like for a dispatch switch branch in an interpreter. Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore. Haswell and Zen2 both use IT-TAGE. That may or may not be relevant to the workload you're profiling. (Not posting an answer since I don't know much about cachegrind.)Evie
Given the significant µop throughput and changes over the years and architectures, it will be at best a rough estimator, and at worst completely misleading. You are always better off doing actual measurements. This was already the case back in 2004. The repeatability and predictability of cachegrind is a lure, you are micro-optimizing for a CPU your users never had.Gravity

© 2022 - 2024 — McMap. All rights reserved.