I have an N core processor ( 4 in my case ). Why isn't N totally independent function calls on N threads roughly N times faster ( of course there is an overhead of creating threads, but read further )?
Look at the the following code:
namespace ch = std::chrono;
namespace mp = boost::multiprecision;
constexpr static unsigned long long int num = 3555;
// mp_factorial uses boost/multiprecision/cpp_int, so I get legit results
ch::steady_clock::time_point s1 = ch::steady_clock::now();
auto fu1 = std::async(std::launch::async, mp_factorial, num);
auto fu2 = std::async(std::launch::async, mp_factorial, num);
auto fu3 = std::async(std::launch::async, mp_factorial, num);
auto fu4 = std::async(std::launch::async, mp_factorial, num);
fu1.get(); fu2.get(); fu3.get(); fu4.get();
ch::steady_clock::time_point e1 = ch::steady_clock::now();
ch::steady_clock::time_point s2 = ch::steady_clock::now();
mp_factorial(num);
mp_factorial(num);
mp_factorial(num);
mp_factorial(num);
ch::steady_clock::time_point e2 = ch::steady_clock::now();
auto t1 = ch::duration_cast<ch::microseconds>(e1 - s1).count();
auto t2 = ch::duration_cast<ch::microseconds>(e2 - s2).count();
cout << t1 << " " << t2 << endl;
I get results like:
11756 20317
Thats roughly 2 times faster. I've also tried this with huge numbers, like num = 355555
. I got really similar results:
177462588 346575062
Why is this the case? I'm perfectly aware of Amdahl's law, and that a multicored processor isn't always number_of_cores
times faster, but when I have independent operations, I'd expect better results. At least something near number_of_cores
.
Update:
As you can see, all threads are working as expected, so this is not the issue:
mp_factorial
doing dynamic allocation? If so, the threads are not really acting independently. – Typecase