I'm a long-time user of cachegrind for program profiling, and recently went back to check the official documentation once more: https://valgrind.org/docs/manual/cg-manual.html
In it, there are multiple references to CPU models, implementation decisions and simulation models that are all from the mid-2000s, and there are also statements that some behavior changed on "modern" processors:
the LL cache typically replicates all the entries of the L1 caches [...] This is standard on Pentium chips, but AMD Opterons, Athlons and Durons use an exclusive LL cache [...]
Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004.
More recent processors have better branch predictors [...] Cachegrind's predictor design is deliberately conservative so as to be representative of the large installed base of processors which pre-date widespread deployment of more sophisticated indirect branch predictors. In particular, late model Pentium 4s (Prescott), Pentium M, Core and Core 2 have more sophisticated indirect branch predictors than modelled by Cachegrind.
Now I'm wondering
- how many of these choices still apply in 2021 when developing on latest-gen CPUs,
- whether the implementation of cachegrind has been updated to reflect latest CPUs, but the manual is outdated,
- whether cachegrind shows skewed results on modern CPUs due to its simulation of legacy behavior.
Any insight is greatly appreciated!
switch
branch in an interpreter. Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore. Haswell and Zen2 both use IT-TAGE. That may or may not be relevant to the workload you're profiling. (Not posting an answer since I don't know much about cachegrind.) – Evie