This graph in "Parallel and Concurrent Programming": http://chimera.labs.oreilly.com/books/1230000000929/ch03.html#fig_kmeans-granularity at first seems to indicate that there's a serious overhead to sparking too much. But if you look carefully at the y-axis, you'll notice that it's been zoomed-in to the interesting section. In fact, the ratio between the best and worst case-performance shown is about 80% which isn't too bad.
In general, figuring out how and how much to chunk is difficult, error-prone, extremely application-specific, and likely to change next year when you buy a new computer with more processing power. I would much prefer to just always use rpar with the most fine-grained items possible and live with the 25% overhead.
Can the overhead for sparking generally incur a much worse cost than shown in this graph? (especially if I always fold over binary trees rather than lists so the second bullet point about "the amount of sequential work" doesn't apply)
Updated question in response to Don Stewart's answer:
Does the spark pool consist of only one queue that all the processors struggle to access? or are there many?
For instance, if I have a computer with infinite processors and a binary tree and I want to take the sum over all the leaves as in:
data Node = Leaf Int | Branch Node Node
sumL (Leaf x) = x
sumL (Branch n1 n2) = let (x,y) = (sumL n1, sumL n2) in (x `par` y) `seq` (x + y)
Will this program run in O(#leaves) time? or O(depth) time? Is there a better way to write this?
Please let me know if I'm abstracting too much to get a satisfactory answer. My mental model of how haskell parallelism works is still very fuzzy.
A single spark is very cheap.
par a b
adds the thunk a to
the (current HEC’s) Spark Pool; this thunk is called a “spark”. [1]If any HEC becomes idle, it can then check the pool and start evaluating the thunk on the top.
So sparking is roughly adding a pointer to a queue.
To make spark distribution cheaper and more asynchronous we re-implemented each HEC’s Spark Pool as a bounded workstealing queue (Arora et al. 1998; Chase and Lev 2005). A workstealing queue is a lock-free data structure with some attractive properties: the owner of the queue can push and pop from one end without synchronisation, meanwhile other threads can “steal” from the other end of the queue incurring only a single atomic instruction.
Also in [1]
The problem is that you can easily create billions of sparks. At that point, you're simply turning your program into a queue builder -- all the time is spent updating the spark pool with pointers to code.
Good advice is to profile, to determine how many sparks are actually turned into work, and to use that to guide thresholds for when to stop sparking.