STL priority_queue compiled with GCC 9 has slower performance comparing to GCC 5 - McMap

About

STL priority_queue compiled with GCC 9 has slower performance comparing to GCC 5

Asked 7/10, 2022 at 12:10 Answered 7/10, 2022 at 13:5

c++performance gcc compiler-optimization priority-queue

K

1

11

For my project I switched from GCC 5 to GCC 9 and found that the performance got worse. I did some investigations and came up with a simple source code which reproduces the behaviour.

I compile the code using different GCC versions (g++-5 and g++-9) on the same machine

#include <queue>

int main()
{
        std::priority_queue<int> q;
        for (int j = 0; j < 2000; j ++) {
                for (int i = 0; i < 20000; i ++) {
                        q.emplace(i);
                }
                for (int i = 0; i < 20000; i ++) {
                        q.pop();
                }
        }
        return 0;
}

When I compile it using GCC 5 I get the following timings:

# g++-5 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m1.580s
user    0m1.578s
sys     0m0.001s

Doing the same with GCC 9 I get:

# g++-9 -std=c++14 -O3 main.cpp
# time ./a.out

real    0m2.292s
user    0m2.288s
sys     0m0.003s

As you can see GCC 9 gives slower results.

I am not sure that the issue is in the STL priority_queue itself. I tried the boost priority_queue and got the same results.

Does anyone have a clue why the performance of this app is slower for GCC 9 comparing to GCC 5? Maybe I should use some compiler flags? Thank you in advance!

Kosse answered 7/10, 2022 at 12:10 Comment(6)

It would be useful if you could do some manual binary search to narrow it down to the precise version of gcc that introduced the performance regression. GCC 5 to 9 is a pretty big jump of over half a decade. – Confluent 7/10, 2022 at 12:17

Please also update your question with the exact version numbers (g++ --version). – Confluent 7/10, 2022 at 12:20

GCC 9 is a bit old. Have you tried with the latest release? – Digged 7/10, 2022 at 12:25

and please compare the output assembly – Maxey 7/10, 2022 at 13:2

Looking at the assembler, I notice that GCC-9 does not inline a call to std::__adjust_heap whereas GCC-5 does not inline std::vector::_M_emplace_back_aux. Why they chose to do that with a single call-site in both cases is beyond me but I guess it could just be a tweak in the tuning options – Castled 7/10, 2022 at 13:57

What CPU do you have? If it's a Skylake, does How can I mitigate the impact of the Intel jcc erratum on gcc? help? If so, it might just be random chance that GCC5 was fast and GCC9 was slow, separate from any missed-optimizations like poor inlining decisions. – Bolanger 7/10, 2022 at 16:7

U

6

This is not meant to be an answer but since I have a few g++ toolchains available I made a few test runs to see if I could see something interesting regarding this perceived degradation.

The biggest slowdown seems to be between 6.2 and 7.2. Perhaps this table can trigger someone to recall what may be the cause.

I used C++11 since I started with gcc 4, so in all cases except the first one, I used g++ -std=c++11 -O3 main.cpp.

g++ version	real	user	sys
4.5.0 (-std=c++0x)	0m1.711s	0m1.701s	0m0.004s
4.8.5	0m1.673s	0m1.667s	0m0.002s
5.1.0	0m1.586s	0m1.578s	0m0.002s
6.2.0	0m1.775s	0m1.766s	0m0.003s
7.2.0	0m2.192s	0m2.176s	0m0.003s
8.2.0	0m2.192s	0m2.186s	0m0.000s
9.3.0	0m2.122s	0m2.114s	0m0.001s
10.2.0	0m2.308s	0m2.299s	0m0.002s
11.3.0	0m2.293s	0m2.285s	0m0.002s
12.1.0	0m2.306s	0m2.299s	0m0.001s

Unblushing answered 7/10, 2022 at 13:5 Comment(9)

Could you try setting a specific -march option? I believe the default tuning changed. Maybe pick something that should be present in all versions like -march=nehalem – Castled 7/10, 2022 at 13:47

@Castled I tried -march=nehalem with a few toolchain versions (those with the biggest diffs) but the results were pretty consistent. Perhaps I should mention the CPU? It's reported as an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz – Unblushing 7/10, 2022 at 13:53

BTW, in case you were considering -march=native, that won't work well. On a GCC too old to know about -march=skylake-avx512, it will still enable the ISA extension options it knows about, but you won't get a -mtune=something-recent, it just gives up and uses -mtune=generic if you use a GCC too old to konw about your CPU specifically. So -march=nehalem to imply -mtune=nehalem is a reasonable choice. – Bolanger 7/10, 2022 at 15:55

Of course your CPU isn't a Nehalem... It is a Skylake, where microcode updates have introduced a few performance pot-holes. One that needs compilers to work around it, if a tight loop happens to step in it: How can I mitigate the impact of the Intel jcc erratum on gcc? – Bolanger 7/10, 2022 at 15:57

@PeterCordes Re: "in case you were considering" - guilty. I tried. Could we set something up that can give us some insight? I'm willing to re-test properly. – Unblushing 7/10, 2022 at 23:6

-march=nehalem is probably fine. -mtune=sandybridge or -mtune=corei7-avx might work, at least for the GCCs new enough to know them. Also use -Wa,-mbranches-within-32B-boundaries to mitigate the problem caused by the microcode workaround for the JCC erratum; that's always a prime suspect for micro-benchmarks on SKL/SKX, esp. if front-end throughput is a problem. But really the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change, and then work from there to see which GCC options or versions help or not with it. – Bolanger 8/10, 2022 at 4:6

@PeterCordes "the best bet is to figure out what asm (or machine-code alignment) difference was causing the big change" - I will try to build as best matrix as I can when I'm back at the store. For our particular needs I think we're not going to change just now, but it's always nice to keep an eye out for options. I'm also not capable to say "what's what" in assembly. – Unblushing 8/10, 2022 at 4:37

Oh, if you mean for production use, -march=native with a recent GCC version is supposed to be good, that's what -march=native is indented for. The reason not to use it for this test is that we want to try ancient GCC versions quite a bit older than your CPU, which will fall back to -mtune=generic if they don't support a -march=skylake-avx512. I would actually strongly recommend against -mtune=sandybridge for general use on a Skylake in cases that include auto-vectorization. (Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?) – Bolanger 8/10, 2022 at 4:41

@PeterCordes Ok, let's see if I can keep up. I did at first just do g++ -std=c++11 -O3. Then I tried -march=nehalem on select versions. I did try -march=native too even though I didn't mention it. I didn't actually see any diff worth mentioning. What kind of matrix is worth building here? I am absolutely not the guy who decides, but I can try things out given instructions. – Unblushing 8/10, 2022 at 4:45

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2025 — McMap. All rights reserved.