env : x86-64; linux-centos; 8-cpu-core
For testing 'false sharing performance' I wrote c++ code like this:
volatile int32_t a;
volatile int32_t b;
int64_t p1[7];
volatile int64_t c;
int64_t p2[7];
volatile int64_t d;
void thread1(int param) {
auto start = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 1000000000; ++i) {
a = i % 512;
}
auto end = chrono::high_resolution_clock::now();
cout << " 1 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}
void thread2(int param) {
auto start = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 1000000000; ++i) {
b = i % 512;
}
auto end = chrono::high_resolution_clock::now();
cout << " 2 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}
void thread3(int param) {
auto start = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 1000000000; ++i) {
c = i % 512;
}
auto end = chrono::high_resolution_clock::now();
cout << " 3 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}
void thread4(int param) {
auto start = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 1000000000; ++i) {
d = i % 512;
}
auto end = chrono::high_resolution_clock::now();
cout << " 4 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}
here is my compile cmd : g++ xxx.cpp --std=c++11 -O0 -lpthread -g
so there is no opt(O0)
I print a、b、c、d virtual addr are
a addr 0x406200
b addr 0x406204
c addr 0x406258
d addr 0x406298
here is execute result:
4 cost:2186474910
3 cost:6114449628
1 cost:7464439728
2 cost:7469428696
what I understood, there is no 'cache bouncing' Or 'false sharing' problem in thread3 with other thread, so why it's slower than thread 4?
addition: if I change int32_t a,b
to int64_t a,b
,
the result changes to:
a addr 0x4061e0
b addr 0x4061e8
c addr 0x406238
d addr 0x406278
3 cost:2188341526
4 cost:2193782423
2 cost:6479324727
1 cost:6645607256
which is what I predict
-O0
and limit it to only 1 store per ~6 clock cycles, bottlenecked on store-forwarding latency of the loop counter? You're usingvolatile
on the actual stores you care about. Are you intentionally benchmarking code that has some dependent loads to trigger possible memory-order mis-speculation? – Hereinafterstd::hardware_destructive_interference_size
would be 128, but only 64 forstd::hardware_constructive_interference_size
); IDK about AMD's prefetchers. – Hereinaftera = i % 512;
can't be optimized out becausea
isvolatile
. That's the whole point of usingvolatile
here: every assignment to it is a visible side effect that the optimizer must respect. (With-O0
, everything is treated sort of likevolatile
.) – Hereinaftera addr 0x40429c b addr 0x404298 c addr 0x404258 d addr 0x404200 4 cost:528467443 3 cost:532451691 1 cost:652654952 2 cost:654210170
then I will check hardware_destructive_interference_size – Sanityalignas
specifier. – Lamebrain