Efficiency of Multithreaded Loops
Asked Answered
B

2

13

Greetings noble community,

I want to have the following loop:

for(i = 0; i < MAX; i++)
    A[i] = B[i] + C[i];

This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.

(for simplicity, assume MAX is a multiple of 4)

Option 1:

for(i = tid; i < MAX; i += 4)
    A[i] = B[i] + C[i];

Option 2:

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
    A[i] = B[i] + C[i];

My question is if there's one that is more efficient then the other and why?

Brok answered 27/1, 2015 at 20:55 Comment(3)
The usual answer for this sort of question is "it depends; try it both ways and measure the results for yourself, otherwise you'll never know for sure"Eldridge
I do not have this requisites to test it on the right way. On the other hand my conclusion is, since we have on "Option 1" an increment of the constant 4, I jump 4 over 4 values of this constant to calculate over the threads the value of A[i]. On "Option 2" I can conclude an unit increment of the variable i but the threads are working between tid*(MAX/4) and (tid+1)*(MAX/4).Brok
In that way I will have on "Option 2" a bigger thread with less space then "Option 1", so more efficiency by picking "Option 2".Brok
M
23

The second one is better than the first one. Simple answer: the second one minimize false sharing

Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.

When would this happen?

Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.

class Foo {

private int foo1;
private int foo2;

}

How bad is false sharing?

I refer Example 6 from the Gallery of Processor Cache Effects

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done. On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!

How to detect false sharing?

Linux Perf could be used to detect cache misses and therefore help you analysis such problem.

refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:

Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses     #    1.54% of all L1-dcache hits   [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
  36,992 L1-dcache-load-misses     #    0.01% of all L1-dcache hits   [50.51%]

It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.

Is there some good practice in industry?

LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.

For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:

this.entries   = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];

source

You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

Moralist answered 27/1, 2015 at 23:16 Comment(0)
C
7

There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in @qqibrow's answer; I won't explain that here as there's already a good answer explaining it.

The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:

A = B + C;

and the corresponding instruction in the vector unit will do multiple additions at the same time:

A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;

(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)

Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
    A[i+0] = B[i+0] + C[i+0];
    A[i+1] = B[i+1] + C[i+1];
    A[i+2] = B[i+2] + C[i+2];
    A[i+3] = B[i+3] + C[i+3];
}

However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).

It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.

Chantal answered 4/11, 2021 at 18:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.