What is "false sharing"? How to reproduce / avoid it?
Asked Answered
C

2

18

Today I got a different understand with my professor on the Parallel Programming class, about what is "false sharing". What my professor said makes little sense so I pointed it out immediately. She thought "false sharing" will cause a mistake in the program's result.

I said, "false sharing" happens when different memory address are assigned to the same cache line, writing data to one of it will cause another being kicked out of the cache. If the processors write between the two false sharing address turn and turn about, both of them could not stay on the cache so all operations will result in the access of DRAMs.

That's my opinion so far. In fact I'm not definitely sure about what I said either... If I got a misunderstanding just point it out please.

So there are some questions. The cache is assumed 64 bytes aligned, 4-way set-associative.

  1. Is it possible that two address separated by more than 64 bytes are “false sharing”?
  2. Is it possible that a single threaded program encountered a "false sharing" issue?
  3. What's the best code example to reproduce the "false sharing"?
  4. In general, what should be noted to avoid "false sharing" for programmers?
Contraction answered 31/3, 2014 at 15:49 Comment(1)
here is a video about false sharing, hope that help. I can't add a comment without 50 reputation, It's really awkward.Butler
F
11

I'll share my point of view on your questions.

  1. Two addresses that are separated by more bytes than block's size, won't reside on the exact same cache line. Thus, if a core has the first address in its cache, and another core requests the second address, the first won't be removed from cache because of that request. So a false sharing miss won't occur.

  2. I can't imagine how false sharing would occur when there's no concurrency at all, as there won't be anyone else but the single thread to compete for the cache line.

  3. Taken from here, using OpenMP, a simple example to reproduce false sharing would be:

    double sum=0.0, sum_local[NUM_THREADS];
    
    #pragma omp parallel num_threads(NUM_THREADS)
    {
        int me = omp_get_thread_num();
        sum_local[me] = 0.0;
    
        #pragma omp for
        for (i = 0; i < N; i++)
            sum_local[me] += x[i] * y[i];
    
        #pragma omp atomic
        sum += sum_local[me];
    }
    
  4. Some general notes that I can think of to avoid false sharing would be:

    a. Use private data as much as possible.

    b. Sometimes you can use padding in order to align data, to make sure that no other variables will reside in the same cache that shared data reside.

Any correction or addition is welcome.

Featured answered 19/11, 2014 at 16:29 Comment(5)
With optimization enabled, any decent compiler would keep sum_local[me] in a register, unless it failed to prove that x[i] and y[i] couldn't alias it. (And some compilers would emit code to check for overlap at runtime, and use an auto-vectorized loop in that case. Otherwise potential aliasing could make the loop suck even without false sharing; just having an extra 5 cycles of store-forwarding latency without false sharing in a loop-carried dependency chain is very bad). But yeah, anything slightly more complex could easily produce false sharing inside the loop, not just at the end.Ridley
@PeterCordes What if NUM_THREADS was not a constant? I don't think the compiler could assign registers in that casePriapism
@tuket: That doesn't matter. All it has to prove is that foo[bar] is the same address every time through the loop, and that writing it won't affect what we read from x[i] or y[i]. (e.g. that sum_local[] doesn't overlap with x[] or y[]). Since me is loop-invariant and sum_local is an array, the compiler can keep a sum in a register inside the loop, and store once at the end using the runtime variable value of me. Each thread has its own me, and knows the base address of the C99 VLA double sum_local[NUM_THREADS];.Ridley
@tuket: Either way, ideally you could have each thread just use its own private scalar variable, but the point of this example is to demonstrate false sharing so it's intentionally written "badly". I still think you'd have to compile with optimization disabled, though, to stop a compiler from optimizing it into a register. Or volatile double sum_local[NUM_THREADS];Ridley
Your answer to 1) Is misleading; a (generally minimal) degree of false sharing still occurs at cache-line size alignments because, on a lot of architectures, one of the hardware prefetchers will pull in multiple cache lines. E.g. Intel's Streamer and Spatial prefetchers pull two cache lines, and they mention 128 byte alignment of memory for this reason.Excurvate
L
1

Some languages (or dialects) can offers solutions to address the problem, so you do not have to pad data manually.

For instance, C, and C++ recently added a special alignas specifier that can help place data further apart automatically, so there is no need in manual padding.

https://mcmap.net/q/742048/-practical-use-cases-for-alignof-and-alignas-c-keywords illustrates usage of the alignas specifier for events that can be shared by several threads of execution.

Rust's crossbeam_utils offers CachePadded type

In Oracles's Java 8 you can annotate data as 'sun.misc.Contended'. However, due improved JVM contended lock performance, in the version 9, the Contended annotation was made a JVM internals ( note, if you really want you still can use JVM internals with a few tricks).

Lakeishalakeland answered 17/1 at 18:12 Comment(2)
alignas() has been available in C since C11, via stdalign.h. (Or via _Alignas()). C23 made alignas a true keyword without having to include a header. I'd just say C and C++ have alignas; compiler warnings and google can point people to stdalign.h if they need it. en.cppreference.com/w/c/language/_AlignasRidley
Agree. If you like, feel free to edit main textLakeishalakeland

© 2022 - 2024 — McMap. All rights reserved.