javamultithreadingfork-joinforkjoinpoolwork-stealing

Java ForkJoinPool with non-recursive tasks, does work-stealing work?


I want to submit Runnable tasks into ForkJoinPool via a method:

forkJoinPool.submit(Runnable task)

Note, I use JDK 7.

Under the hood, they are transformed into ForkJoinTask objects. I know that ForkJoinPool is efficient when a task is split into smaller ones recursively.

Question:

Does work-stealing still work in the ForkJoinPool if there is no recursion?

Is it worth it in this case?

Update 1: Tasks are small and can be unbalanced. Even for strictly equal tasks, such things like context switching, thread scheduling, parking, pages misses etc. get in the way leading to the imbalance.

Update 2: Doug Lea wrote in the Concurrency JSR-166 Interest group, by giving a hint on this:

This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.

I presume, when it comes to reasonably small CPU-bound tasks, ForkJoinPool is the way to go, thanks to this optimization. The main point is that these tasks are already small and needn't a recursive decomposition. Work-stealing works, regardless whether it is a big or small task - tasks can be grabbed by another free worker from the Deque's tail of a busy worker.

Update 3: Scalability of ForkJoinPool - benchmarking by Akka team of ping-pong shows great results.

Despite this, to apply ForkJoinPool more efficiently requires performance tuning.


Solution

  • ForkJoinPool source code has a nice section called "Implementation Overview", read up for an ultimate truth. The explanation below is my understanding for JDK 8u40.

    Since day one, ForkJoinPool had a work queue per worker thread (let's call them "worker queues"). The forked tasks are pushed into the local worker queue, ready to be popped by the worker again and be executed -- in other words, it looks like a stack from worker thread perspective. When a worker depletes its worker queue, it goes around and tries to steal the tasks from other worker queues. That is "work stealing".

    Now, before (IIRC) JDK 7u12, ForkJoinPool had a single global submission queue. When worker threads ran out of local tasks, as well the tasks to steal, they got there and tried to see if external work is available. In this design, there is no advantage against a regular, say, ThreadPoolExecutor backed by ArrayBlockingQueue.

    It changed significantly after then. After this submission queue was identified as the serious performance bottleneck, Doug Lea et al. striped the submission queues as well. In hindsight, that is an obvious idea: you can reuse most of the mechanics available for worker queues. You can even loosely distribute these submission queues per worker threads. Now, the external submission goes into one of the submission queues. Then, workers that have no work to munch on, can first look into the submission queue associated with a particular worker, and then wander around looking into the submission queues of others. One can call that "work stealing" too.

    I have seen many workloads benefiting from this. This particular design advantage of ForkJoinPool even for plain non-recursive tasks was recognized a long ago. Many users at concurrency-interest@ asked for a simple work-stealing executor without all the ForkJoinPool arcanery. This is one of the reasons, why we have Executors.newWorkStealingPool() in JDK 8 onward -- currently delegating to ForkJoinPool, but open for providing a simpler implementation.