Why don't I see more false sharing when different Threads write to the same variable?
Asked Answered
D

1

1

I am trying to understand a simple example and wonder why I don't see more false sharing than reported by perf c2c.

In my example (a matrix multiplication), two instances could cause false sharing (other than const-reads):

  • Different threads in a threadpool decrement an atomic counter to know which data they should work on.
  • The threads write their computation at differnt offset (of stride 4-bytes) to the same array. Importantly, there's ever only one write to the array position. The information is not needed by anybody else.

In the remainder, I ignore the atomic counter because its shared on purpose.

I expect to see cacheline contention because of the following ressources:

  1. The Linux kernel docs describe false sharing as:

There are two key factors for a harmful false sharing:

  • A global datum accessed (shared) by many CPUs
  • In the concurrent accesses to the data, there is at least one write operation: write/write or write/read cases.
  1. In this SO thread there's an explicit mention of false-sharing when threads only write to the position.

Based on that statement, I would expect to see false sharing when different threads just write to individual positions of the same cachline. The relevant code for the joint write to the array is below (the full code is here):

void fma_f32_single(const float* __restrict__ aptr,
                    const float* __restrict__ bptr, size_t M, size_t N,
                    size_t K, float* __restrict__ cptr) {
    float c0{0};
    for (size_t i = 0; i < K; ++i) {
        c0 += (*(aptr + i)) * (*(bptr + N * i));
    }
    *cptr = c0;
}

struct pthreadpool_context {
    const float* __restrict__ a;
    const float* __restrict__ b;
    float* __restrict__ c;
    size_t M;
    size_t N;
    size_t K;
    std::vector<std::pair<size_t, size_t>> indices;
};

void work(void* ctx, size_t i) {
    const pthreadpool_context* context = (pthreadpool_context*)ctx;
    const auto [row, col] = context->indices[i];
    const float* aptr = context->a + row * context->K;
    const float* bptr = context->b + col;
    float* cptr = context->c + row * context->N + col;
    // Increasing col by one is just a four byte different on c.
    // Threads write their output to cptr. 
    fma_f32_single(aptr, bptr, context->M, context->N, context->K, cptr);
}

When profiling the code with perf c2c record -F 60000 ./a.out and reporting it with perf c2c report -c tid,iaddr, the only catchline shown to be shared is the one containing the atomic counter, but not the ones holding the array c.

My questions are:

  • Should I expect to see false sharing on the output array c? Do I maybe need to record events other than the default ones?
  • The test system is an Intel Coffe Lake (no Numa). Should I expect that there's also no sharing on other generations of Intel machines, and also on ARM machines?

Perf c2c output In case that's relevant, here's the full perf output (the code fragment threadpool.cpp:25 denotes the atomic decrement of the counter):

=================================================
            Trace Event Information              
=================================================
  Total records                     :     414185
  Locked Load/Store Operations      :        126
  Load Operations                   :     164311
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          1
  Loads - no mapping                :          7
  Load Fill Buffer Hit              :        477
  Load L1D hit                      :      84675
  Load L2D hit                      :          7
  Load LLC hit                      :      79115
  Load Local HITM                   :         40
  Load Remote HITM                  :          0
  Load Remote HIT                   :          0
  Load Local DRAM                   :         29
  Load Remote DRAM                  :          0
  Load MESI State Exclusive         :          0
  Load MESI State Shared            :         29
  Load LLC Misses                   :         29
  Load access blocked by data       :          0
  Load access blocked by address    :          0
  Load HIT Local Peer               :          0
  Load HIT Remote Peer              :          0
  LLC Misses to Local DRAM          :      100.0%
  LLC Misses to Remote DRAM         :        0.0%
  LLC Misses to Remote cache (HIT)  :        0.0%
  LLC Misses to Remote cache (HITM) :        0.0%
  Store Operations                  :     249874
  Store - uncacheable               :          0
  Store - no mapping                :          0
  Store L1D Hit                     :     249829
  Store L1D Miss                    :         45
  Store No available memory level   :          0
  No Page Map Rejects               :       4617
  Unable to parse data source       :          0

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :          1
  Load HITs on shared lines         :        185
  Fill Buffer Hits on shared lines  :         43
  L1D hits on shared lines          :         66
  L2D hits on shared lines          :          0
  LLC hits on shared lines          :         76
  Load hits on peer cache or nodes  :          0
  Locked Access on shared lines     :        104
  Blocked Access on shared lines    :          0
  Store HITs on shared lines        :        785
  Store L1D hits on shared lines    :        785
  Store No available memory level   :          0
  Total Merged records              :        825

=================================================
                 c2c details                     
=================================================
  Events                            : cpu/mem-loads,ldlat=30/P
                                    : cpu/mem-stores/P
  Cachelines sort on                : Total HITMs
  Cacheline data grouping           : offset,tid,iaddr

=================================================
           Shared Data Cache Line Table          
=================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  --------- Stores --------  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss      N/A       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0      0x7fff33e60700     0     151  100.00%       40       40        0      970      185      785      785        0        0       43       66        0        36       40         0        0         0         0

=================================================
      Shared Cache Line Distribution Pareto      
=================================================
#
#        ----- HITM -----  ------- Store Refs ------  --------- Data address ---------                                     ---------- cycles ----------    Total       cpu                                  Shared                         
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A              Offset  Node  PA cnt            Tid        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol  Object        Source:Line  Node
# .....  .......  .......  .......  .......  .......  ..................  ....  ......  .............  ..................  ........  ........  ........  .......  ........  ..............................  .....  .................  ....
#
  ----------------------------------------------------------------------
      0        0       40      785        0        0      0x7fff33e60700
  ----------------------------------------------------------------------
           0.00%    2.50%   23.44%    0.00%    0.00%                0x34     0       1    84530:a.out      0x5e87bc063dfe         0       241       130      203         1  [.] ThreadPool::QueueTask(void  a.out  atomic_base.h:628   0
           0.00%    2.50%   24.84%    0.00%    0.00%                0x34     0       1    84533:a.out      0x5e87bc063a32         0       306       188      233         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%    0.00%   25.61%    0.00%    0.00%                0x34     0       1    84532:a.out      0x5e87bc063a32         0         0       167      225         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%    0.00%   26.11%    0.00%    0.00%                0x34     0       1    84534:a.out      0x5e87bc063a32         0         0       193      228         1  [.] ThreadMain(std::stop_token  a.out  atomic_base.h:628   0
           0.00%   42.50%    0.00%    0.00%    0.00%                0x38     0       1    84533:a.out      0x5e87bc0639f0         0       132       128       33         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0
           0.00%   35.00%    0.00%    0.00%    0.00%                0x38     0       1    84534:a.out      0x5e87bc0639f0         0       133       121       28         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0
           0.00%   17.50%    0.00%    0.00%    0.00%                0x38     0       1    84532:a.out      0x5e87bc0639f0         0       124       111       20         1  [.] ThreadMain(std::stop_token  a.out  threadpool.cpp:25   0
Drily answered 8/9, 2024 at 16:1 Comment(1)
Can you make this into an MRE so that we can see clearly how ctx is populated?Leanto
D
0

Perf c2c didn't show more instances of cachline sharing on the Coffe Lake target, but it showed cachline sharing on other targets (an Alder Lake laptop and a Graviton 3 instances).

The key to obtain more detailed stats is the sampling ratio. The max sampling frequency on the Alder Lake laptop is 120HZ. (I'm not sure how sampling on the Graviton 3 works.)

The frequency on an Intel machine can be influenced with the -F X option, and the system wide max frequency can be read at /proc/sys/kernel/perf_event_max_sample_rate.

Drily answered 10/9, 2024 at 5:50 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.