Question
As Fork-Join seems to be the current hype and recommended in many answers, I thought: why not do some research on how fast it actually is?
To measure this, I wrote a small program (see code below) that does some adding of numbers and forked it out with various parameters, including number of threads, fork-depth and fork-spread, then measured the time for execution and especially the time spent for actual calculation vs. time spent on forking.
Abstract Answer
While being implemented well, ForkJoin is an extremely inefficient way to parallelize a task, as the cost for each fork is very high. A naive problem-optimized implementation can easily archive 99% thread-execution-time (which beats everything measured with Fork-Join) and therefore such an implementation is always faster than a Fork-Join implementation. In addition, if the actual task per fork is minor, a Fork-Join implementation can be much slower than even a single-threaded linear implementation.
So Fork-Join is more a question of whether or not it helps the architecture of your code, as it doesn't have any performance benefits over other implementations. Therefore Fork-Join should only be used if:
Performance isn't critical and tasks frequently need to wait for the result of other tasks to continue. So basically if the Fork-Join structure greatly simplifies the task over a naive implementation.
The actual task takes greatly out-weights the cost for forking, and the loss therefore becomes negligible. In my test a loop that added 2 values had to loop at least 10000 times per fork to get a reasonable performance.
Edit: See here for a more in-depth analysis that I was pointed to.
Test Setup
In my program I had a RecursiveTask calculate a Fibonacci series for a given N, which reduces the actual calculation to 3 assignments and 1 addition. For any given CPU this should be a minor task.
Over the test I varied the amount of threads, the amount of forks per task and the length of the Fibonacci loop. In addition I did some tests with the async parameter, but setting this one to false only showed a minor reduction in calculation time, so I skipped that. The spread parameter (forks per fork) has as well been mostly skipped, as there was no significant difference in the result.
In general the calculation time is very stable, the actual percentage of time spent on the task usually varies by less than 1%, therefore each test set has been run about 5 times (or more if the numbers were unstable) on an otherwise idle system with 4 cores (+4 hyper-cores) and then the median execution time has been selected.
The proper execution has been verified through various test variables, especially the number of actual threads used has been verified to never differ from the initially given parallelism parameter.
Detailed Test Results
Where:
Time total
is the total time the entire calculation took from the perspective of the main thread.Time task
is the time spent on actually calculating the Fibonacci series in all forks combined.Time task percentage
is the relative gain thanks to threading (time task / time total).spread->depth
is the (set) spread (fork per fork) and the (calculated) forking-depth.threads
is the actual amount of threads used.task-time/thread
is the time each thread actually spent on calculating the Fibonacci series over-all.
Spread->depth test:
Time total: 8766.670 ms, time task: 1717.418 ms ( 19.59%), spread->depth: 2->26, thread#: 1, task-time/thread: 19.59%
Time total: 7872.244 ms, time task: 1421.478 ms ( 18.06%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.06%
Time total: 7336.052 ms, time task: 1280.036 ms ( 17.45%), spread->depth: 100-> 4, thread#: 1, task-time/thread: 17.45%
Conclusion: Number of forks only has a minor effect (still less forks = better), the implementation seems to be fairly sophisticated. Similar results were collected with other settings, so I skip these here.
Fib(0) (almost all time spent on forking)
Time total: 7866.777 ms, time task: 1421.488 ms ( 18.07%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.07%
Time total: 7085.142 ms, time task: 1349.207 ms ( 19.04%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 9.52%
Time total: 6580.609 ms, time task: 1476.467 ms ( 22.44%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 5.61%
Conclusion: With a very minor task, most time is spent on forking, making a single-threaded implementation about 5 times faster than any Fork-Join setup. Even with multiple threads it is impossible to gain any performance increase using Fork-Join.
Fib(100)
Time total: 12487.634 ms, time task: 5707.720 ms ( 45.71%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 45.71%
Time total: 8386.855 ms, time task: 5768.881 ms ( 68.78%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 34.39%
Time total: 7078.769 ms, time task: 6086.997 ms ( 85.99%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 21.50%
Conclusion: seems to be close to the break-even point for single-threaded execution, while multi-threading begins to have an impact. Still a single-threaded implementation would be faster than any Fork-Join setup.
Fib(1000)
Time total: 5941.344 ms, time task: 5228.258 ms ( 88.00%), spread->depth: 10-> 7, thread#: 1, task-time/thread: 88.00%
Time total: 3160.818 ms, time task: 5244.241 ms (165.91%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 82.96%
Time total: 16301.697 ms, time task: 53351.694 ms (327.28%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 81.82%
Conclusion: Times start to stabilize here for multi-threading execution with a near linear gain, while still ~20% of the calculation time per thread is spent on forking. While at this point forking can increase performance through threading, a naive implementation would still be noticeably faster.
Fib(10000)
Time total: 5204.786 ms, time task: 5119.133 ms ( 98.35%), spread->depth: 10-> 6, thread#: 1, task-time/thread: 98.35%
Time total: 26033.889 ms, time task: 51084.118 ms (196.22%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 98.11%
Time total: 13183.573 ms, time task: 51637.471 ms (391.68%), spread->depth: 10-> 7, thread#: 4, task-time/thread: 97.92%
Conclusion: At this number the calculation out-weights the cost for forking. While a naive implementation would still be slightly faster, the loss caused by forking is negligible if the task would have been much more difficult to implement in another way.
Code
public class Test {
static final int NUM_THREADS = 4;
static final int SPREAD = 10;
static final int LOOPS = 4000000;
static final int CALCULATION_N = 10000;
static final boolean DO_ASYNC = true;
//---
static final long MAX_DEPTH = Math.round(Math.log(LOOPS) / Math.log(SPREAD)); // try to have the execution take about the same time
private static class Task extends RecursiveTask<Integer> {
final static AtomicLong timeExecute = new AtomicLong(0);
final static AtomicLong totalLoops = new AtomicLong(0);
final long depth;
public Task(final long depth) {
this.depth = depth;
}
@Override
protected Integer compute() {
if (depth < MAX_DEPTH) {
final Task[] subTasks = new Task[SPREAD];
for (int i = 0; i < subTasks.length; ++i) {
subTasks[i] = new Task(depth + 1);
}
try {
invokeAll(subTasks);
final long startTime = System.nanoTime();
int result = 0;
for (final Task task : subTasks) {
if (task.isCompletedNormally()) {
result += task.get();
}
}
timeExecute.addAndGet(System.nanoTime() - startTime);
return result;
} catch (Exception e) {
this.completeExceptionally(e);
return null;
}
} else {
totalLoops.incrementAndGet();
final long startTime = System.nanoTime();
int a = 0, b = 1, h;
for (int n = 0; n < CALCULATION_N; ++n) {
h = b;
b = a + b;
a = h;
}
timeExecute.addAndGet(System.nanoTime() - startTime);
return b;
}
}
}
public static void main(String[] args) {
final AtomicInteger threadCount = new AtomicInteger(0);
final ForkJoinPool pool = new ForkJoinPool(NUM_THREADS, new ForkJoinPool.ForkJoinWorkerThreadFactory() {
@Override
public ForkJoinWorkerThread newThread(ForkJoinPool pool) {
threadCount.getAndIncrement();
final ForkJoinWorkerThread result = ForkJoinPool.defaultForkJoinWorkerThreadFactory.newThread(pool);
result.setPriority(Thread.MIN_PRIORITY);
return result;
}
}, null, DO_ASYNC);
final long startTime = System.nanoTime();
final Integer result = pool.invoke(new Task(0));
final double duration = ((double) (System.nanoTime() - startTime)) / 1000000.0;
final double executionDuration = ((double) Task.timeExecute.get()) / 1000000.0;
final double executionPercent = executionDuration / duration * 100.0;
final double executionPercentPerThread = executionPercent / ((double) NUM_THREADS);
System.out.println("Completed: " + result + " in " + Task.totalLoops.get() + " loops.");
System.out.println(String.format("Time total: %8.3f ms, time task: %8.3f ms (%6.2f%%), spread->depth: %2d->%2d, thread#: %1d, task-time/thread: %5.2f%%", duration, executionDuration, executionPercent, SPREAD, MAX_DEPTH, threadCount.get(), executionPercentPerThread));
}
}
Feel free to point out any mistakes or to make suggestions for improvement. I will accept the most valuable answer for some bonus points.