Analysis: Performance of ForkJoinPool
Asked Answered
L

3

5

Question

As Fork-Join seems to be the current hype and recommended in many answers, I thought: why not do some research on how fast it actually is?

To measure this, I wrote a small program (see code below) that does some adding of numbers and forked it out with various parameters, including number of threads, fork-depth and fork-spread, then measured the time for execution and especially the time spent for actual calculation vs. time spent on forking.

Abstract Answer

While being implemented well, ForkJoin is an extremely inefficient way to parallelize a task, as the cost for each fork is very high. A naive problem-optimized implementation can easily archive 99% thread-execution-time (which beats everything measured with Fork-Join) and therefore such an implementation is always faster than a Fork-Join implementation. In addition, if the actual task per fork is minor, a Fork-Join implementation can be much slower than even a single-threaded linear implementation.

So Fork-Join is more a question of whether or not it helps the architecture of your code, as it doesn't have any performance benefits over other implementations. Therefore Fork-Join should only be used if:

  • Performance isn't critical and tasks frequently need to wait for the result of other tasks to continue. So basically if the Fork-Join structure greatly simplifies the task over a naive implementation.

  • The actual task takes greatly out-weights the cost for forking, and the loss therefore becomes negligible. In my test a loop that added 2 values had to loop at least 10000 times per fork to get a reasonable performance.

Edit: See here for a more in-depth analysis that I was pointed to.

Test Setup

In my program I had a RecursiveTask calculate a Fibonacci series for a given N, which reduces the actual calculation to 3 assignments and 1 addition. For any given CPU this should be a minor task.

Over the test I varied the amount of threads, the amount of forks per task and the length of the Fibonacci loop. In addition I did some tests with the async parameter, but setting this one to false only showed a minor reduction in calculation time, so I skipped that. The spread parameter (forks per fork) has as well been mostly skipped, as there was no significant difference in the result.

In general the calculation time is very stable, the actual percentage of time spent on the task usually varies by less than 1%, therefore each test set has been run about 5 times (or more if the numbers were unstable) on an otherwise idle system with 4 cores (+4 hyper-cores) and then the median execution time has been selected.

The proper execution has been verified through various test variables, especially the number of actual threads used has been verified to never differ from the initially given parallelism parameter.

Detailed Test Results

Where:

  • Time total is the total time the entire calculation took from the perspective of the main thread.
  • Time task is the time spent on actually calculating the Fibonacci series in all forks combined.
  • Time task percentage is the relative gain thanks to threading (time task / time total).
  • spread->depth is the (set) spread (fork per fork) and the (calculated) forking-depth.
  • threads is the actual amount of threads used.
  • task-time/thread is the time each thread actually spent on calculating the Fibonacci series over-all.

Spread->depth test:

Time total: 8766.670 ms, time task: 1717.418 ms ( 19.59%), spread->depth:  2->26, thread#: 1, task-time/thread: 19.59%
Time total: 7872.244 ms, time task: 1421.478 ms ( 18.06%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.06%
Time total: 7336.052 ms, time task: 1280.036 ms ( 17.45%), spread->depth: 100-> 4, thread#: 1, task-time/thread: 17.45%

Conclusion: Number of forks only has a minor effect (still less forks = better), the implementation seems to be fairly sophisticated. Similar results were collected with other settings, so I skip these here.

Fib(0) (almost all time spent on forking)

Time total: 7866.777 ms, time task: 1421.488 ms ( 18.07%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 18.07%
Time total: 7085.142 ms, time task: 1349.207 ms ( 19.04%), spread->depth: 10-> 8, thread#: 2, task-time/thread:  9.52%
Time total: 6580.609 ms, time task: 1476.467 ms ( 22.44%), spread->depth: 10-> 8, thread#: 4, task-time/thread:  5.61%

Conclusion: With a very minor task, most time is spent on forking, making a single-threaded implementation about 5 times faster than any Fork-Join setup. Even with multiple threads it is impossible to gain any performance increase using Fork-Join.

Fib(100)

Time total: 12487.634 ms, time task: 5707.720 ms ( 45.71%), spread->depth: 10-> 8, thread#: 1, task-time/thread: 45.71%
Time total:  8386.855 ms, time task: 5768.881 ms ( 68.78%), spread->depth: 10-> 8, thread#: 2, task-time/thread: 34.39%
Time total:  7078.769 ms, time task: 6086.997 ms ( 85.99%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 21.50%

Conclusion: seems to be close to the break-even point for single-threaded execution, while multi-threading begins to have an impact. Still a single-threaded implementation would be faster than any Fork-Join setup.

Fib(1000)

Time total:  5941.344 ms, time task:  5228.258 ms ( 88.00%), spread->depth: 10-> 7, thread#: 1, task-time/thread: 88.00%
Time total:  3160.818 ms, time task:  5244.241 ms (165.91%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 82.96%
Time total: 16301.697 ms, time task: 53351.694 ms (327.28%), spread->depth: 10-> 8, thread#: 4, task-time/thread: 81.82%

Conclusion: Times start to stabilize here for multi-threading execution with a near linear gain, while still ~20% of the calculation time per thread is spent on forking. While at this point forking can increase performance through threading, a naive implementation would still be noticeably faster.

Fib(10000)

Time total:  5204.786 ms, time task:  5119.133 ms ( 98.35%), spread->depth: 10-> 6, thread#: 1, task-time/thread: 98.35%
Time total: 26033.889 ms, time task: 51084.118 ms (196.22%), spread->depth: 10-> 7, thread#: 2, task-time/thread: 98.11%
Time total: 13183.573 ms, time task: 51637.471 ms (391.68%), spread->depth: 10-> 7, thread#: 4, task-time/thread: 97.92%

Conclusion: At this number the calculation out-weights the cost for forking. While a naive implementation would still be slightly faster, the loss caused by forking is negligible if the task would have been much more difficult to implement in another way.

Code

public class Test {

  static final int NUM_THREADS = 4;
  static final int SPREAD = 10;
  static final int LOOPS = 4000000;
  static final int CALCULATION_N = 10000;
  static final boolean DO_ASYNC = true;
  //---
  static final long MAX_DEPTH = Math.round(Math.log(LOOPS) / Math.log(SPREAD)); // try to have the execution take about the same time

  private static class Task extends RecursiveTask<Integer> {

    final static AtomicLong timeExecute = new AtomicLong(0);
    final static AtomicLong totalLoops = new AtomicLong(0);
    final long depth;

    public Task(final long depth) {
      this.depth = depth;
    }

    @Override
    protected Integer compute() {
      if (depth < MAX_DEPTH) {
        final Task[] subTasks = new Task[SPREAD];
        for (int i = 0; i < subTasks.length; ++i) {
          subTasks[i] = new Task(depth + 1);
        }
        try {
          invokeAll(subTasks);
          final long startTime = System.nanoTime();
          int result = 0;
          for (final Task task : subTasks) {
            if (task.isCompletedNormally()) {
              result += task.get();
            }
          }
          timeExecute.addAndGet(System.nanoTime() - startTime);
          return result;
        } catch (Exception e) {
          this.completeExceptionally(e);
          return null;
        }
      } else {
        totalLoops.incrementAndGet();
        final long startTime = System.nanoTime();
        int a = 0, b = 1, h;
        for (int n = 0; n < CALCULATION_N; ++n) {
          h = b;
          b = a + b;
          a = h;
        }
        timeExecute.addAndGet(System.nanoTime() - startTime);
        return b;
      }
    }
  }

  public static void main(String[] args) {
    final AtomicInteger threadCount = new AtomicInteger(0);
    final ForkJoinPool pool = new ForkJoinPool(NUM_THREADS, new ForkJoinPool.ForkJoinWorkerThreadFactory() {
      @Override
      public ForkJoinWorkerThread newThread(ForkJoinPool pool) {
        threadCount.getAndIncrement();
        final ForkJoinWorkerThread result = ForkJoinPool.defaultForkJoinWorkerThreadFactory.newThread(pool);
        result.setPriority(Thread.MIN_PRIORITY);
        return result;
      }
    }, null, DO_ASYNC);
    final long startTime = System.nanoTime();
    final Integer result = pool.invoke(new Task(0));
    final double duration = ((double) (System.nanoTime() - startTime)) / 1000000.0;
    final double executionDuration = ((double) Task.timeExecute.get()) / 1000000.0;
    final double executionPercent = executionDuration / duration * 100.0;
    final double executionPercentPerThread = executionPercent / ((double) NUM_THREADS);

    System.out.println("Completed: " + result + " in " + Task.totalLoops.get() + " loops.");
    System.out.println(String.format("Time total: %8.3f ms, time task: %8.3f ms (%6.2f%%), spread->depth: %2d->%2d, thread#: %1d, task-time/thread: %5.2f%%", duration, executionDuration, executionPercent, SPREAD, MAX_DEPTH, threadCount.get(), executionPercentPerThread));
  }
}

Feel free to point out any mistakes or to make suggestions for improvement. I will accept the most valuable answer for some bonus points.

Lingulate answered 29/11, 2013 at 14:52 Comment(4)
-1. Seems like you've already concluded that the FJ framework is "extremely inefficient" and are just looking for discussion/argument.Frizzle
Yes that is the result of my analysis. And yes I am looking for discussion and/or arguments that either support my result or prove me wrong. That's called a scientific approach. Not sure why you give a -1 on that. OoLingulate
This isn't a discussion forum. See stackoverflow.com/help/dont-ask .Frizzle
This question appears to be off-topic because it is more of a discussion than an actual question.Gotland
D
5

Suggestions:

  • Print the number of forks made + the cost estimation of the work which is done (i.e. the number of additions or a length of BigIntegers which are being summed if you switch to them). This proportion will show how effective is your forking strategy and give you an understanding of what is the minimal job size which makes sense.
  • Check your algorithm - Fibonacci has an exponential growth, you task returns integer, so you should be getting an overflow very soon.

So, the goal is to choose a threshold which would say to fork or not to fork:

One of the main things to consider when implementing an algorithm using fork/join parallelism is choosing the threshold which determines whether a task will execute a sequential computation rather than forking parallel sub-tasks.

If the threshold is too large, then the program might not create enough tasks to fully take advantage of the available processors/cores.

If the threshold is too small, then the overhead of task creation and management could become significant.

In general, some experimentation will be necessary to find an appropriate threshold value. Source

This also could be useful: How to determine the proper work division threshold of a fork-join task.

Derby answered 29/11, 2013 at 15:27 Comment(1)
There is no threshold where forking is faster, it is more an architectural question whether or not to use it. Any naive implementation is always faster. The result of the Fibonacci is irrelevant. It will eventually overflow, but that doesn't cause any issued with Java, as an overflow is silently discarded.Lingulate
S
3

I did not try your test yet, but for any divide/conquer or queueing approach you must weigh in the cost of splitting work, queue and job handling and aggregating jobs results. So there will never be 100% efficiency in terms of total CPU cycles compared to singe-thread versions. I have another fibonacci-based test myself where I experiment with setting a recursion limit so that fib( limit) is calculated recursively in the same thread without generating new jobs for the next recursion level. So the time spent for this recursion level is the time taken by each ForkJoinTask. I measure that time before the actual benchmark to find the sweet spot for how long should a task be for the optimum balance between minimum overhead and maximum core utilization. For the hardware I tested on, that was around 10µs for single-socket x86 to 1ms for 4-way machines.

Seychelles answered 29/11, 2013 at 17:23 Comment(4)
It definitely would make sense to add something that measures the rate while running, as in worst case a Fork-Join implementation can even be much slower than any single-threaded implementation. But then this greatly increases the code complexity, something that this construct actually is supposed to reduce.Lingulate
As soon as you found that sweet spot for your target platform, you don't need to experiment further. You just have to know how long your tasks are gong to take under normal circumstances, so that you know when to stop splitting and start getting to work. But anyway, reducing code size by going from single-threaded to fork-join would be unlikely.Seychelles
The big problem is that your sweet spot on your machine could be something entirely different then the sweet spot on a different machine. I already had to face situations, where my code was running faster on my desktop, then on an extremely powerful server machine. In the end it is better to use an architecture where you do not have to search for this spot.Lingulate
This is why I gave these estimates. We are talking about the single-threaded job that is eventually run by the fork-join executor (or any kind of executor, for that matter). This will likely be faster on a well-specced desktop than on a potentially virtualized, overloaded server machine that is only fast because of multiprocessing, that is, after you parallelized everything. If the job size stays above 1ms (or possibly more on a overloaded machine, the executor overhead becomes negligible.Seychelles
M
2

Your "measurements" have a large observer effect...

You might want to replace you AtomicLongs with LongAdder to reduce the impact of your measurements... think about reducing them even more...

Use a framework like JMH to mitigate against benchmarking gotchas...

Your measurements are not something that anyone can use to make any non-naive conclusion...

FJP is a very good Thread Pool implementation, it is the best option you have in the JDK to take advantage of your cpu cores.

In my benchmarks (using JMH) comparing FJP with the "Legacy" JDK executors:

https://github.com/zolyfarkas/spf4j/blob/master/spf4j-benchmarks/src/test/java/org/spf4j/concurrent/ThreadPoolBenchmarkFjp.java

and

https://github.com/zolyfarkas/spf4j/blob/master/spf4j-benchmarks/src/test/java/org/spf4j/concurrent/ThreadPoolBenchmarkStdJdk.java

Running on jdk 1.7 FJP is about 2 times faster:

Benchmark                                   Mode  Cnt     Score     Error  Units
ThreadPoolBenchmarkFjp.fjpBenchmark        thrpt   10  6873.926 ± 334.733  ops/s
ThreadPoolBenchmarkStdJdk.stdJdkBenchmark  thrpt   10  3210.170 ± 170.883  ops/s

With Jdk 1.8 FJP being 3 times faster:

Benchmark                                   Mode  Cnt     Score      Error  Units
ThreadPoolBenchmarkFjp.fjpBenchmark        thrpt   10  9679.502 ± 1160.887  ops/s
ThreadPoolBenchmarkStdJdk.stdJdkBenchmark  thrpt   10  3466.997 ±   81.594  ops/s
Molloy answered 12/9, 2015 at 22:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.