Why is C++ executable running so much faster when linked against newer libstdc++.so?
Asked Answered
U

2

4

I have a project (code here) in which I run benchmarks to compare the performance of different methods for computing dot product (Naive method, Eigen library, SIMD implementation, ect). I am testing on a fresh Centos 7.6 VM. I have noticed that when I use different versions of libstdc++.so.6, I get significantly different performance.

When I spin up a new Centos 7.6 instance, the default C++ standard library is libstdc++.so.6.0.19. When I run my benchmark executable (linked against this version of libstdc++) the output is as follows:

Naive Implementation, 1000000 iterations: 1448.74 ns average time
Optimized Implementation, 1000000 iterations: 1094.2 ns average time
AVX2 implementation, 1000000 iterations: 1069.57 ns average time
Eigen Implementation, 1000000 iterations: 1027.21 ns average time
AVX & FMA implementation 1, 1000000 iterations: 1028.68 ns average time
AVX & FMA implementation 2, 1000000 iterations: 1021.26 ns average time

If I download libstdc++.so.6.0.26 and change the symbolic link libstdc++.so.6 to point to this newer library and rerun the executable (without recompiling or changing anything else), the results are as follows:

Naive Implementation, 1000000 iterations: 297.981 ns average time
Optimized Implementation, 1000000 iterations: 156.649 ns average time
AVX2 implementation, 1000000 iterations: 131.577 ns average time
Eigen Implementation, 1000000 iterations: 92.9909 ns average time
AVX & FMA implementation 1, 1000000 iterations: 78.136 ns average time
AVX & FMA implementation 2, 1000000 iterations: 80.0832 ns average time

Why is there such a significant improvement in speed (some implementations are 10x faster)?

Due to my use case, I may be required to link against libstdc++.so.6.0.19. Is there anything I can do in my code / on my side to see these speed improvements while using the older version of libstdc++?

Edit: I created a minimum reproducible example.

main.cpp

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>
#include <cmath>
#include <iostream>

typedef std::chrono::high_resolution_clock Clock;

const size_t SIZE_FLOAT = 512;

double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2);
void generateNormalizedData(std::vector<uint8_t>& v);

int main() {
     // Seed for random number
    srand (time(nullptr));

    std::vector<uint8_t> v1;
    std::vector<uint8_t> v2;

    generateNormalizedData(v1);
    generateNormalizedData(v2);

    const size_t numIterations = 10000000;
    double totalTime = 0.0;

    for (size_t i = 0; i < numIterations; ++i) {
        auto t1 = Clock::now(); 
        auto similarity = computeDotProductOptomized(v1, v2);
        auto t2 = Clock::now();

        totalTime += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }

    std::cout << "Average Time Taken: " << totalTime / numIterations << '\n';

    return 0;
}

double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2) {
    const auto *x = reinterpret_cast<const float*>(v1.data());
    const auto *y = reinterpret_cast<const float*>(v2.data());

    double similarity = 0;

    for (size_t i = 0; i < SIZE_FLOAT; ++i) {
        similarity += *(x + i) * *(y + i);
    }

    return similarity;
}

void generateNormalizedData(std::vector<uint8_t>& v) {
    std::vector<float> vFloat(SIZE_FLOAT);
    v.resize(SIZE_FLOAT * sizeof(float));

    for(float & i : vFloat) {
        i = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    }

    // Normalize the vector
    float mod = 0.0;

    for (float i : vFloat) {
        mod += i * i;
    }

    float mag = std::sqrt(mod);

    if (mag == 0) {
        throw std::logic_error("The input vector is a zero vector");
    }

    for (float & i : vFloat) {
        i /= mag;
    }

    memcpy(v.data(), vFloat.data(), v.size());
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.14)
project(dot-prod-benchmark-min-reproducible-example C CXX)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Ofast -ffast-math -march=broadwell")
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_STANDARD 14)

add_executable(benchmark main.cpp)

Compiled on centos-release-7-6.1810.2.el7.centos.x86_64, using cmake version 3.16.2, gcc (GCC) 7.3.1 20180303 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, 4 vCPUs

Using libstdc++.so.6.0.19: Average Time Taken: 1279.41 Using libstdc++.20.6.0.26: Average Time Taken: 168.219

Uphemia answered 2/1, 2020 at 22:20 Comment(21)
Maybe one of them has a broken implementation of the timing facilities? Did you try to time them by some other mean? Did you profile the slow version to see what takes time?Kathrynkathryne
Can you make a minimal reproducible example for us that exhibits this behavior? Since all values change you can just use the naive implementation. Then we could test this on our machines.Fagaceous
@MarcGlisse I don't think the timing implementation is broken as I notice it takes longer to run the executable. As for profiling, that is a good idea. I don't have much experience doing so though, are there any good tools or resources you can point me to?Uphemia
@Fagaceous you can clone the repo if you like, I have included build instructions and it should run as is (and requires no other dependencies). Otherwise, I will ad a min reproducible example shortly.Uphemia
Even the naive implementation improved significantly, can you post that code?Upcountry
How were the different libstdc++ binaries compiled? Were they compiled with the same compiler and compilation options?Dufrene
Is the old libstdc++ forcing a syscall, whereas the new one uses VDSO? I'd only expect that of libc differences, but try running it under strace to be sure ...Peder
Does your code actually work in both cases, did you check the result?Designate
@Designate If you have a look at the project I have linked, you will see that before I run the benchmarks I ensure all the implementations are computing the dot product correctly.Uphemia
Did you actually benchmark the results of both runs (ala Valgrind or similar) to see where the code is spending all its time? Also, it would help if you talked about what type of CPU you have and whether you are 32-bit or 64-bit OS. **Oh wait - I see your github mentions: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz, 16GB ram using gcc 9"Cheapjack
@Cheapjack For the above question I am using a Centos VM with the following specs: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, 4 vCPUs, centos-release-7-6.1810.2.el7.centos.x86_64Uphemia
@Upcountry I added sample codeUphemia
@BaummitAugen the 19 version of the library comes with the operating system so I am not sure how it was compiled and with what options. As for the 26 version of the library, I built it using this tutorial to build gcc 9.1. libstdc++.so.6.0.26 was installed when I built gcc9.1. Note I had gcc 7.3.1 installed at the time of building gcc 9.1Uphemia
I would time all iterations together, not each one separately. Chrono::now() can be fast or slow, depending on implementation.Designate
@Cheapjack No I did not profile to see where it is spending all the time as I am not super familiar with profiling. How can I do so using valgrind or other tools?Uphemia
@Designate even if there is an issue with chrono::now() I am running the dot product 10M times so I can notice the difference in execution time between the two versions. One takes significantly less time to run than the other. I am therefore confident that one version is running faster than the other.Uphemia
Did you try to profile your code, or are you asking others to profile for you?Isoleucine
Try taking chrono::now() calls out of the loop to call them twice instead of 20M times.Designate
@PavelP I honestly didn't think to profile, but I will do so nowUphemia
@Designate thank you that fixed the problem! Looks like it was the call to Clock::now() which was taking longerUphemia
I think you are running into the issue explained here: gcc.1065356.n8.nabble.com/… Without knowing the glibc version against which libstdc++ was compiled and without the configuration flags used, it is hard to tell though. As mentioned in a comment above, try running your program with strace and watch for difference in syscalls.Oasis
U
6

rustyx was correct. It was the use of auto t1 = Clock::now(); in the loop that was causing the poor performance. Once I moved the timing to outside the loop (time the total time taken) then they run equally fast:

    const size_t numIterations = 10000000;
    auto t1 = Clock::now(); 

    for (size_t i = 0; i < numIterations; ++i) {
        auto similarity = computeDotProductOptomized(v1, v2);
    }

    auto t2 = Clock::now();

    std::cout << "Total Time Taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
Uphemia answered 2/1, 2020 at 23:28 Comment(0)
E
2

Your old libstdc++.so comes from GCC 4.8, and in that version the Clock::now() calls make direct system calls to the kernel to get the current time.

That is much slower than using the clock_gettime function in libc, which gets the result from the kernel's vDSO library instead of making a system call. That is what the newer libstdc++.so is doing.

Unfortunately GCC 4.8.x was released before Glibc made the clock_gettime function available without linking to librt.so and so the libstdc++.so in CentOS 7 doesn't know it could use the clock_gettime in Glibc instead of a direct system call. There's a configure option that can be used when building GCC 4.8.x that tells it to look for the function in libc.so, but the CentOS 7 compiler isn't built with that option enabled. I don't think there's any way to fix that without using a different libstdc++.so library.

Emmalynn answered 21/1, 2020 at 12:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.