I am trying to understand a simple example and wonder why I don't see more false sharing than reported by perf c2c
.
In my example (a matrix multiplication), two instances could cause false sharing (other than const-reads):
- Different threads in a threadpool decrement an atomic counter to know which data they should work on.
- The threads write their computation at differnt offset (of stride 4-bytes) to the same array. Importantly, there's ever only one write to the array position. The information is not needed by anybody else.
In the remainder, I ignore the atomic counter because its shared on purpose.
I expect to see cacheline contention because of the following ressources:
- The Linux kernel docs describe false sharing as:
There are two key factors for a harmful false sharing:
- A global datum accessed (shared) by many CPUs
- In the concurrent accesses to the data, there is at least one write operation: write/write or write/read cases.
- In this SO thread there's an explicit mention of false-sharing when threads only write to the position.
Based on that statement, I would expect to see false sharing when different threads just write to individual positions of the same cachline. The relevant code for the joint write to the array is below (the full code is here):
void fma_f32_single(const float* __restrict__ aptr,
const float* __restrict__ bptr, size_t M, size_t N,
size_t K, float* __restrict__ cptr) {
float c0{0};
for (size_t i = 0; i < K; ++i) {
c0 += (*(aptr + i)) * (*(bptr + N * i));
}
*cptr = c0;
}
struct pthreadpool_context {
const float* __restrict__ a;
const float* __restrict__ b;
float* __restrict__ c;
size_t M;
size_t N;
size_t K;
std::vector<std::pair<size_t, size_t>> indices;
};
void work(void* ctx, size_t i) {
const pthreadpool_context* context = (pthreadpool_context*)ctx;
const auto [row, col] = context->indices[i];
const float* aptr = context->a + row * context->K;
const float* bptr = context->b + col;
float* cptr = context->c + row * context->N + col;
// Increasing col by one is just a four byte different on c.
// Threads write their output to cptr.
fma_f32_single(aptr, bptr, context->M, context->N, context->K, cptr);
}
When profiling the code with perf c2c record -F 60000 ./a.out
and reporting it with perf c2c report -c tid,iaddr
, the only catchline shown to be shared is the one containing the atomic counter, but not the ones holding the array c.
My questions are:
- Should I expect to see false sharing on the output array
c
? Do I maybe need to record events other than the default ones? - The test system is an Intel Coffe Lake (no Numa). Should I expect that there's also no sharing on other generations of Intel machines, and also on ARM machines?
Perf c2c output
In case that's relevant, here's the full perf output (the code fragment threadpool.cpp:25
denotes the atomic decrement of the counter):
=================================================
Trace Event Information
=================================================
Total records : 414185
Locked Load/Store Operations : 126
Load Operations : 164311
Loads - uncacheable : 0
Loads - IO : 0
Loads - Miss : 1
Loads - no mapping : 7
Load Fill Buffer Hit : 477
Load L1D hit : 84675
Load L2D hit : 7
Load LLC hit : 79115
Load Local HITM : 40
Load Remote HITM : 0
Load Remote HIT : 0
Load Local DRAM : 29
Load Remote DRAM : 0
Load MESI State Exclusive : 0
Load MESI State Shared : 29
Load LLC Misses : 29
Load access blocked by data : 0
Load access blocked by address : 0
Load HIT Local Peer : 0
Load HIT Remote Peer : 0
LLC Misses to Local DRAM : 100.0%
LLC Misses to Remote DRAM : 0.0%
LLC Misses to Remote cache (HIT) : 0.0%
LLC Misses to Remote cache (HITM) : 0.0%
Store Operations : 249874
Store - uncacheable : 0
Store - no mapping : 0
Store L1D Hit : 249829
Store L1D Miss : 45
Store No available memory level : 0
No Page Map Rejects : 4617
Unable to parse data source : 0
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 1
Load HITs on shared lines : 185
Fill Buffer Hits on shared lines : 43
L1D hits on shared lines : 66
L2D hits on shared lines : 0
LLC hits on shared lines : 76
Load hits on peer cache or nodes : 0
Locked Access on shared lines : 104
Blocked Access on shared lines : 0
Store HITs on shared lines : 785
Store L1D hits on shared lines : 785
Store No available memory level : 0
Total Merged records : 825
=================================================
c2c details
=================================================
Events : cpu/mem-loads,ldlat=30/P
: cpu/mem-stores/P
Cachelines sort on : Total HITMs
Cacheline data grouping : offset,tid,iaddr
=================================================
Shared Data Cache Line Table
=================================================
#
# ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total --------- Stores -------- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ----
# Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss N/A FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt
# ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........
#
0 0x7fff33e60700 0 151 100.00% 40 40 0 970 185 785 785 0 0 43 66 0 36 40 0 0 0 0
=================================================
Shared Cache Line Distribution Pareto
=================================================
#
# ----- HITM ----- ------- Store Refs ------ --------- Data address --------- ---------- cycles ---------- Total cpu Shared
# Num RmtHitm LclHitm L1 Hit L1 Miss N/A Offset Node PA cnt Tid Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node
# ..... ....... ....... ....... ....... ....... .................. .... ...... ............. .................. ........ ........ ........ ....... ........ .............................. ..... ................. ....
#
----------------------------------------------------------------------
0 0 40 785 0 0 0x7fff33e60700
----------------------------------------------------------------------
0.00% 2.50% 23.44% 0.00% 0.00% 0x34 0 1 84530:a.out 0x5e87bc063dfe 0 241 130 203 1 [.] ThreadPool::QueueTask(void a.out atomic_base.h:628 0
0.00% 2.50% 24.84% 0.00% 0.00% 0x34 0 1 84533:a.out 0x5e87bc063a32 0 306 188 233 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 0.00% 25.61% 0.00% 0.00% 0x34 0 1 84532:a.out 0x5e87bc063a32 0 0 167 225 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 0.00% 26.11% 0.00% 0.00% 0x34 0 1 84534:a.out 0x5e87bc063a32 0 0 193 228 1 [.] ThreadMain(std::stop_token a.out atomic_base.h:628 0
0.00% 42.50% 0.00% 0.00% 0.00% 0x38 0 1 84533:a.out 0x5e87bc0639f0 0 132 128 33 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0
0.00% 35.00% 0.00% 0.00% 0.00% 0x38 0 1 84534:a.out 0x5e87bc0639f0 0 133 121 28 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0
0.00% 17.50% 0.00% 0.00% 0.00% 0x38 0 1 84532:a.out 0x5e87bc0639f0 0 124 111 20 1 [.] ThreadMain(std::stop_token a.out threadpool.cpp:25 0