bytes aligned and false sharing cause performance diff on x86-64 [duplicate]
Asked Answered
S

0

0

env : x86-64; linux-centos; 8-cpu-core
For testing 'false sharing performance' I wrote c++ code like this:

volatile int32_t a;
volatile int32_t b;
int64_t p1[7];
volatile int64_t c;
int64_t p2[7];
volatile int64_t d;

void thread1(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        a = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 1 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread2(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        b = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 2 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread3(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        c = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 3 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread4(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        d = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 4 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

here is my compile cmd : g++ xxx.cpp --std=c++11 -O0 -lpthread -g so there is no opt(O0)

I print a、b、c、d virtual addr are

a addr 0x406200
b addr 0x406204
c addr 0x406258
d addr 0x406298

here is execute result:

 4 cost:2186474910
 3 cost:6114449628
 1 cost:7464439728
 2 cost:7469428696

what I understood, there is no 'cache bouncing' Or 'false sharing' problem in thread3 with other thread, so why it's slower than thread 4?

addition: if I change int32_t a,b to int64_t a,b, the result changes to:

a addr 0x4061e0
b addr 0x4061e8
c addr 0x406238
d addr 0x406278
3 cost:2188341526
4 cost:2193782423
2 cost:6479324727
1 cost:6645607256

which is what I predict

Sanity answered 8/11, 2021 at 7:17 Comment(8)
I calculate first case addr 0x406200 0x406204 0x406258 0x406298 10-decimalism are 4219392 4219396 4219480 4219544 . 4219392 is multiple of 64, so itSanity
Why would you use -O0 and limit it to only 1 store per ~6 clock cycles, bottlenecked on store-forwarding latency of the loop counter? You're using volatile on the actual stores you care about. Are you intentionally benchmarking code that has some dependent loads to trigger possible memory-order mis-speculation?Hereinafter
L2 spatial prefetch might be causing some interference for thread C; it's in the same 128-byte aligned pair of cache lines as A and B. (Unlike D). What specific CPU model do you have? Intel has a spatial prefetcher that tries to complete adjacent lines (so for current CPUs an appropriate value for std::hardware_destructive_interference_size would be 128, but only 64 for std::hardware_constructive_interference_size); IDK about AMD's prefetchers.Hereinafter
this is independent and complete code for testing performance about cache fake-sharing problem, I use -O0 because if I dont, loop maybe opt out by g++, there is no other code except main() call thread t1(thread1, 1); to thread t4(thread4, 4) and join and printf a to dSanity
a = i % 512; can't be optimized out because a is volatile. That's the whole point of using volatile here: every assignment to it is a visible side effect that the optimizer must respect. (With -O0, everything is treated sort of like volatile.)Hereinafter
yes u'r right about O0, I change to -O1 and disassemble code, the loop is still there and cost change to a addr 0x40429c b addr 0x404298 c addr 0x404258 d addr 0x404200 4 cost:528467443 3 cost:532451691 1 cost:652654952 2 cost:654210170 then I will check hardware_destructive_interference_sizeSanity
Oh, the variables changed address, I think in reverse order. Put them in a struct and align the struct by 128 (or 4096) if you want to control for that.Hereinafter
I am not exactly sure if the standard specifies how variables should be ordered in memory with regard to the order of their declarations (with the exceptions of non-static member variables with the same access rights, which does not apply here). In such a case, you cannot make any assumptions about "distances" of storage of your variables in memory. The correct solution would be to enforce stricter alignment such as with the alignas specifier.Lamebrain

© 2022 - 2024 — McMap. All rights reserved.