Why isn't N independent calculations N times faster on N threads?
Asked Answered
B

2

9

I have an N core processor ( 4 in my case ). Why isn't N totally independent function calls on N threads roughly N times faster ( of course there is an overhead of creating threads, but read further )?

Look at the the following code:

namespace ch = std::chrono;
namespace mp = boost::multiprecision;
constexpr static unsigned long long int num = 3555;

// mp_factorial uses boost/multiprecision/cpp_int, so I get legit results

    ch::steady_clock::time_point s1 = ch::steady_clock::now();
    auto fu1 = std::async(std::launch::async, mp_factorial, num);
    auto fu2 = std::async(std::launch::async, mp_factorial, num);
    auto fu3 = std::async(std::launch::async, mp_factorial, num);
    auto fu4 = std::async(std::launch::async, mp_factorial, num);
    fu1.get(); fu2.get(); fu3.get(); fu4.get();
    ch::steady_clock::time_point e1 = ch::steady_clock::now();

    ch::steady_clock::time_point s2 = ch::steady_clock::now();
    mp_factorial(num);
    mp_factorial(num);
    mp_factorial(num);
    mp_factorial(num);
    ch::steady_clock::time_point e2 = ch::steady_clock::now();

    auto t1 = ch::duration_cast<ch::microseconds>(e1 - s1).count();
    auto t2 = ch::duration_cast<ch::microseconds>(e2 - s2).count();

    cout << t1 << " " << t2 << endl;

I get results like:

11756 20317

Thats roughly 2 times faster. I've also tried this with huge numbers, like num = 355555. I got really similar results:

177462588 346575062

Why is this the case? I'm perfectly aware of Amdahl's law, and that a multicored processor isn't always number_of_cores times faster, but when I have independent operations, I'd expect better results. At least something near number_of_cores.


Update:

As you can see, all threads are working as expected, so this is not the issue:

Workload of threads

Battlement answered 15/7, 2015 at 22:46 Comment(10)
Do you really have four cores? Or just two cores with four hyperthreads? (As e.g. the Intel i3).Offcolor
Also: Did you check that the 4 function calls in the sequential test aren't inlined?Cognac
@BaummitAugen Yes. I've tested it on an i7-4700MQ. 4 cores, 8 hyperthreads.Battlement
It depends on the efficiency of the operating-system’s scheduler. Other programs and background tasks will be assigned to the cores and get quanta from some but not all of the cores that the threads are using and users have very little control over that; it’s difficult to force the OS to dedicate a core entirely to a thread from a program (setting affinities only locks a thread to a core, it doesn't block other processes from using it).Seminary
@DanielJour I didn't. How can I explicitly forbid inlining? Haven't really thought about that, these numbers get huge quickly, like 5 MB-s of memory. But still, copying 5 MB-s in a DDR3 RAM isn't really comparable to calculating it IMO.Battlement
@DanielJour I mean this whole thing takes like 2 minutes. Thats not comparable to copying roughly 5 MB-s in memory.Battlement
Even if you have 4 cores, you only have 1 RAM. If the algorithm isn't running from L1 cache you may be getting contention that limits the total throughput.Udelle
Do you see it using 4 different cores? This question was saying he only say the threads running on one core...Duro
Use a function pointer and assign it in a volatile environment: bool volatile yes = true; if (yes) { fnptr = mp_factorial; } else { fnptr = dummyfn; } fnptr(num); // 3 more times. Though I'd be surprised if that were the only cause.Cognac
Is mp_factorial doing dynamic allocation? If so, the threads are not really acting independently.Typecase
K
16

The problem here is that you certainly have some large lumpy numbers, which do not fit in the L1 and L2 caches of your processor, meaning that the processor sits and twiddles its little ALU fingers whilst the memory controller is jumping all over the place trying to read a little bit of memory for each processor.

When you run in ONE thread, that one thread is will at least largely only work on three different memory regions (a = b * c, reading from b and c, writing to a).

When you do 4 threads, you have four different a = b * c; with three different streams of data each, leading to both more thrashing of caches, the memory controller and the "open pages" [pages here are a DRAM term, nothing to do with MMU pages, but you may also find that TLB misses are a factor as well].

So you get better performance from running more threads, but not 4x because of the large amount of data being consumed and produced by each thread, the memory interface is the bottle-neck. Other than getting a machine with a more efficient memory interface [and that may not be so easy], there's nothing much you can do about this - just accept that for this particular case, memory is more of a limiting factor than the calculation.

The ideal example for solving with multithreading are those that require a lot of calculation, but doesn't use very much memory. I have a simple prime number calculator and one that calculates "weird numbers", both give almost exactly Nx performance improvement when run on N cores [but would I start using these for numbers that are many times larger than 64-bits, it would stop giving the same benefit]

Edit: There's also the possibility of:

  • some function(s) that your code calls a lot are locking/blocking the other threads [possibly in a busy-wait fashion, if the implementation expects short wait-times, as calling the OS to wait a few dozen clock-cycles is pointless] - things like new and malloc and their freeing counterparts are plausible candidates.
  • True of false sharing - data is being shared between CPU-cores causing cache-content being sent back and forth between processors. Particularly small shared arrays that are accessed [and updated] from each thread can cause this to go wrong, even if the updates are done locklessly and with atomic operations.

The term "false" sharing is used when you have something like this

 // Some global array. 
 int array[MAX_THREADS];
 ....
 // some function that updates the global array
 int my_id = thread_id();
 array[my_id]++;

Although each thread has it's own array entry, the same cache-line is bounced from one CPU to the other. I once had a SMP (before multi-core) dhrystone benchmark that ran at 0.7x the performance of one processor when running on 2 processors - because ONE of the commonly accessed data-items was stored as an int array[MAX_THREADS]. This is of course a rather extreme example...

Keturahkeung answered 15/7, 2015 at 23:11 Comment(1)
Yup, thats it. I am developing a thread pool class and I met this issue during that. I've tried adding less memory intensive tasks to the pool and the scaling increased a lot ( near N times ). Thanks.Battlement
M
-2

Your answer sort of depends on User threads or Kernel threads. If the threads you are using are implemented in user space the kernel is not aware of them therefore they cannot execute in a true "parallel" fashion across multiple physical cpu cores.

If the threads are implemented in Kernel space then the kernel is aware of the threads and can possibly handle them in a parallel fashion across multiple physical cpu cores.

There is also overhead for thread creation, destruction and context switching. Every time a thread context switches the thread library needs to store values and load values etc.

Mishap answered 15/7, 2015 at 23:12 Comment(3)
I doubt very much the first paragraph applies, given the graph in the question. For a benchmark that takes more than a few milliseconds on a modern desktop computer OS / processor combination, I'd say the thread creation and destruction is also insignificant.Keturahkeung
@MatsPetersson I'm not familiar with any system that has 5 threads on a CPU so I was wondering where he got the graph in the first place.Mishap
@Mishap The main function runs in a thread. Thats thread 1. The 4 std::asyncs create 4 other threads, those are working ( at least in the parallel part ). BTW I got the graph from my IDE ( Xcode ).Battlement

© 2022 - 2024 — McMap. All rights reserved.