This graph in "Parallel and Concurrent Programming": http://chimera.labs.oreilly.com/books/1230000000929/ch03.html#fig_kmeans-granularity at first seems to indicate that there's a serious overhead to sparking too much. But if you look carefully at the y-axis, you'll notice that it's been zoomed-in to the interesting section. In fact, the ratio between the best and worst case-performance shown is about 80% which isn't too bad.
In general, figuring out how and how much to chunk is difficult, error-prone, extremely application-specific, and likely to change next year when you buy a new computer with more processing power. I would much prefer to just always use rpar with the most fine-grained items possible and live with the 25% overhead.
Can the overhead for sparking generally incur a much worse cost than shown in this graph? (especially if I always fold over binary trees rather than lists so the second bullet point about "the amount of sequential work" doesn't apply)
Updated question in response to Don Stewart's answer:
Does the spark pool consist of only one queue that all the processors struggle to access? or are there many?
For instance, if I have a computer with infinite processors and a binary tree and I want to take the sum over all the leaves as in:
data Node = Leaf Int | Branch Node Node
sumL (Leaf x) = x
sumL (Branch n1 n2) = let (x,y) = (sumL n1, sumL n2) in (x `par` y) `seq` (x + y)
Will this program run in O(#leaves) time? or O(depth) time? Is there a better way to write this?
Please let me know if I'm abstracting too much to get a satisfactory answer. My mental model of how haskell parallelism works is still very fuzzy.