STL priority_queue compiled with GCC 9 has slower performance comparing to GCC 5
Asked Answered
K

1

11

For my project I switched from GCC 5 to GCC 9 and found that the performance got worse. I did some investigations and came up with a simple source code which reproduces the behaviour.

I compile the code using different GCC versions (g++-5 and g++-9) on the same machine

#include <queue>

int main()
{
        std::priority_queue<int> q;
        for (int j = 0; j < 2000; j ++) {
                for (int i = 0; i < 20000; i ++) {
                        q.emplace(i);
                }
                for (int i = 0; i < 20000; i ++) {
                        q.pop();
                }
        }
        return 0;
}

When I compile it using GCC 5 I get the following timings:

# g++-5 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m1.580s
user    0m1.578s
sys     0m0.001s

Doing the same with GCC 9 I get:

# g++-9 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m2.292s
user    0m2.288s
sys     0m0.003s

As you can see GCC 9 gives slower results.

I am not sure that the issue is in the STL priority_queue itself. I tried the boost priority_queue and got the same results.

Does anyone have a clue why the performance of this app is slower for GCC 9 comparing to GCC 5? Maybe I should use some compiler flags? Thank you in advance!

Kosse answered 7/10, 2022 at 12:10 Comment(6)
It would be useful if you could do some manual binary search to narrow it down to the precise version of gcc that introduced the performance regression. GCC 5 to 9 is a pretty big jump of over half a decade.Confluent
Please also update your question with the exact version numbers (g++ --version).Confluent
GCC 9 is a bit old. Have you tried with the latest release?Digged
and please compare the output assemblyMaxey
Looking at the assembler, I notice that GCC-9 does not inline a call to std::__adjust_heap whereas GCC-5 does not inline std::vector::_M_emplace_back_aux. Why they chose to do that with a single call-site in both cases is beyond me but I guess it could just be a tweak in the tuning optionsCastled
What CPU do you have? If it's a Skylake, does How can I mitigate the impact of the Intel jcc erratum on gcc? help? If so, it might just be random chance that GCC5 was fast and GCC9 was slow, separate from any missed-optimizations like poor inlining decisions.Bolanger
U
6

This is not meant to be an answer but since I have a few g++ toolchains available I made a few test runs to see if I could see something interesting regarding this perceived degradation.

The biggest slowdown seems to be between 6.2 and 7.2. Perhaps this table can trigger someone to recall what may be the cause.

I used C++11 since I started with gcc 4, so in all cases except the first one, I used g++ -std=c++11 -O3 main.cpp.

g++ version real user sys
4.5.0
(-std=c++0x)
0m1.711s 0m1.701s 0m0.004s
4.8.5 0m1.673s 0m1.667s 0m0.002s
5.1.0 0m1.586s 0m1.578s 0m0.002s
6.2.0 0m1.775s 0m1.766s 0m0.003s
7.2.0 0m2.192s 0m2.176s 0m0.003s
8.2.0 0m2.192s 0m2.186s 0m0.000s
9.3.0 0m2.122s 0m2.114s 0m0.001s
10.2.0 0m2.308s 0m2.299s 0m0.002s
11.3.0 0m2.293s 0m2.285s 0m0.002s
12.1.0 0m2.306s 0m2.299s 0m0.001s
Unblushing answered 7/10, 2022 at 13:5 Comment(9)
Could you try setting a specific -march option? I believe the default tuning changed. Maybe pick something that should be present in all versions like -march=nehalemCastled
@Castled I tried -march=nehalem with a few toolchain versions (those with the biggest diffs) but the results were pretty consistent. Perhaps I should mention the CPU? It's reported as an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHzUnblushing
BTW, in case you were considering -march=native, that won't work well. On a GCC too old to know about -march=skylake-avx512, it will still enable the ISA extension options it knows about, but you won't get a -mtune=something-recent, it just gives up and uses -mtune=generic if you use a GCC too old to konw about your CPU specifically. So -march=nehalem to imply -mtune=nehalem is a reasonable choice.Bolanger
Of course your CPU isn't a Nehalem... It is a Skylake, where microcode updates have introduced a few performance pot-holes. One that needs compilers to work around it, if a tight loop happens to step in it: How can I mitigate the impact of the Intel jcc erratum on gcc?Bolanger
@PeterCordes Re: "in case you were considering" - guilty. I tried. Could we set something up that can give us some insight? I'm willing to re-test properly.Unblushing
-march=nehalem is probably fine. -mtune=sandybridge or -mtune=corei7-avx might work, at least for the GCCs new enough to know them. Also use -Wa,-mbranches-within-32B-boundaries to mitigate the problem caused by the microcode workaround for the JCC erratum; that's always a prime suspect for micro-benchmarks on SKL/SKX, esp. if front-end throughput is a problem. But really the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change, and then work from there to see which GCC options or versions help or not with it.Bolanger
@PeterCordes "the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change" - I will try to build as best matrix as I can when I'm back at the store. For our particular needs I think we're not going to change just now, but it's always nice to keep an eye out for options. I'm also not capable to say "what's what" in assembly.Unblushing
Oh, if you mean for production use, -march=native with a recent GCC version is supposed to be good, that's what -march=native is indented for. The reason not to use it for this test is that we want to try ancient GCC versions quite a bit older than your CPU, which will fall back to -mtune=generic if they don't support a -march=skylake-avx512. I would actually strongly recommend against -mtune=sandybridge for general use on a Skylake in cases that include auto-vectorization. (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)Bolanger
@PeterCordes Ok, let's see if I can keep up. I did at first just do g++ -std=c++11 -O3. Then I tried -march=nehalem on select versions. I did try -march=native too even though I didn't mention it. I didn't actually see any diff worth mentioning. What kind of matrix is worth building here? I am absolutely not the guy who decides, but I can try things out given instructions.Unblushing

© 2022 - 2024 — McMap. All rights reserved.