Is it possible to speed up a quicksort with par in Haskell?

Asked 3/11, 2013 at 12:32 Answered 28/4, 2019 at 21:3

Solved haskell parallel-processing profiling quicksort

I have got this seemingly trivial parallel quicksort implementation, the code is as follows:

import System.Random
import Control.Parallel
import Data.List

quicksort :: Ord a => [a] -> [a]
quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort

-- pQuicksort, parallelQuicksort  
-- As long as n > 0 evaluates the lower and upper part of the list in parallel,
-- when we have recursed deep enough, n==0, this turns into a serial quicksort.
pQuicksort :: Ord a => Int -> [a] -> [a]
pQuicksort _ [] = []
pQuicksort 0 (x:xs) =
  let (lower, upper) = partition (< x) xs
  in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper
pQuicksort n (x:xs) =
  let (lower, upper) = partition (< x) xs
      l = pQuicksort (n `div` 2) lower
      u = [x] ++ pQuicksort (n `div` 2) upper
  in (par u l) ++ u

main :: IO ()
main = do
  gen <- getStdGen
  let randints = (take 5000000) $ randoms gen :: [Int]
  putStrLn . show . sum $ (quicksort randints)

I compile with

ghc --make -threaded -O2 quicksort.hs

and run with

./quicksort +RTS -N16 -RTS

No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu.

Is it possible to explain why this runs so much slower on several CPUs than on one?
Is it possible to make this scale, at least sub linearly, with the number of CPUs by doing some trick?

EDIT: @tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add.

import System.Random
import Control.Parallel

splitList :: [a] -> ([a], [a])
splitList = helper True [] []
  where helper _ left right [] = (left, right)
        helper True  left right (x:xs) = helper False (x:left) right xs
        helper False left right (x:xs) = helper True  left (x:right) xs

merge :: (Ord a) => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) = case x<y of
  True  -> x : merge xs (y:ys)
  False -> y : merge (x:xs) ys

mergeSort :: (Ord a) => [a] -> [a]
mergeSort xs = pMergeSort 16 xs -- we use 16 sparks

-- pMergeSort, parallel merge sort. Takes an extra argument
-- telling how many sparks to create. In our simple test it is
-- set to 16
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
  let (left, right) = splitList xs
  in  merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
  let (left, right) = splitList xs
      l = pMergeSort (n `div` 2) left
      r = pMergeSort (n `div` 2) right
  in  (r `par` l) `pseq` (merge l r)

ris :: Int -> IO [Int]
ris n = do
  gen <- getStdGen
  return . (take n) $ randoms gen

main = do
  r <- ris 100000
  putStrLn . show . sum $ mergeSort r

Confessional answered 3/11, 2013 at 12:32 Comment(12)

Note that this is really an implementation of quicksort: #7718191 – Iranian 3/11, 2013 at 14:11

At least I can't get it to perform better with pseq, even when purging down any possible thunks with sum. Perhaps there's an entirely different problem involved. — As I have now deleted by answer, here again as a comment: 1. naming that function just quicksort might confuse since you wouldn't expect such a function to accept an extra parallelism argument; 2. Use type signatures, just always for top-level functions and even more so when they might work slightly different than what the name suggests; 3. use library functions such as partition if possible. — Good question, BTW. – Rabbet 3/11, 2013 at 14:58

I don't have enough time to post a full answer, but I guess there are two possible issues: (1) You should rather use l `par` u `pseq` (u ++ l). (2) While you run sub-computations in parallel, they're not really evaluated until needed. So you should force each sub-list to NF (or at least its full structure), something like forceList l `par` forceList u `pseq` (u ++ l) where forceList is (your own) function that forces evaluation of a list. Also for proper benchmarking I suggest to use criterion. – Henleyonthames 3/11, 2013 at 20:44

If you want a quick and easy way to see how your sparks are doing, you can compile with the -rtsopts flag, and add the -sstderr flag when you then run the program. – Amylolysis 3/11, 2013 at 20:52

@PetrPudlák: I tried all that, it doesn't seem to work here. – Rabbet 3/11, 2013 at 22:3

@leftaroundabout: I have implemented your suggestions. The code should be easier to understand now. I am still intrigued by this code though. My previous uses of par have given perfect results, but not this one. – Confessional 4/11, 2013 at 4:53

Actually, the mergesort implementation performs with almost constant speed on my machine as long as I don't use more threads than I have cores. I'm beginning to think the main problem we have is to do with memory/cache; lists just aren't great in that regard. If all cores wait most of the time fetching memory pages, little can be gained with parallelism. In the quicksort this is apparently more critical than in mergesort. – Rabbet 4/11, 2013 at 9:32

I am running these examples on a 24 core cluster. Should it not perform better regarding cache when it uses 16 cores than one? It certainly has a lot more cache available that way at least. Is the slowdown coming from the merge phase when lists have to be moved from two cores to one? – Confessional 4/11, 2013 at 9:43

@Confessional - More cache isn't much benefit if all those caches end up holding the same data. Maybe it makes sense to partition the data into blocks (around the size of one cores cache), then merge-sort the chunks with each merge-sort running sequentially on one core. That way, after that initial partitioning, each thread only needs it's block in its cache for the mergesort. Performance in the partitioning may not be as important - maybe a lazy version, finding blocks sequentially, mostly waiting for the next core to be free to handle a block... – Long 4/11, 2013 at 12:43

@Steve314: I think I am doing exactly what you say. I divide the problem in two until there is 16 parts, that is divide in 2 4 times recursively. Then the algorithm sorts these 16 parts with a sequential merge sort. – Confessional 5/11, 2013 at 8:28

@Confessional - I'm not sure I was clear. I meant quicksort-style partitioning, so that each mergesorted block is already in order relative to the other blocks, leaving only the problem of concatenation once the mergesorts are done - merging the sorted blocks. Lazy evaluation of partitioning should give blocks in order (lowest-to-highest or visa-versa). It's an annoying startup delay waiting for the first few blocks to get your cores running, but it avoids the need to merge the blocks at the end (assuming concatenation can be done faster). – Long 5/11, 2013 at 14:39

@Confessional - the idea is that mergesort locality is good only while sorting chunks up to the size of available cache. With linked lists, once the lists are larger than the cache, given all the re-ordering during the sort so far, locality will be terrible. But at the start of sorting (certainly in your test, maybe sometimes in reality too) the way the list is stored (or evaluated) may suit a linear scan, giving good locality for the partitioning. That leaves concatenation of chunks at the end, but there are tricks to deal with that in O(n) at least. – Long 5/11, 2013 at 14:46

I'm not sure this is worth noting, given @lehins excellent answer, but...

Why your `pQuickSort` doesn't work

There are two big problems with your pQuickSort. The first is that you're using System.Random, which is bog slow and interacts strangely with a parallel sort (see below). The second is that your par u l sparks a computation to evaluate:

u = [x] ++ pQuicksort (n `div` 2) upper

to WHNF, namely u = x : UNEVALUATED_THUNK, so your sparks aren't doing any real work.

Observing an improvement with a simple pseudo-quicksort

In fact, it's not difficult to observe a performance improvement when parallelizing a naive, not-in-place, pseudo-quicksort. As mentioned, an important consideration is to avoid using System.Random. With a fast LCG, we can benchmark the actual sort time, rather than some weird mixture of sort and random number generation. The following pseudo-quicksort:

import Data.List

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
    in qsort a ++ x:qsort b
qsort [] = []

randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  let randints = randomList 5000000
  print . sum $ qsort randints

when compiled with GHC 8.6.4 and -O2, runs in about 9.7 seconds. The following "parallelized" version:

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
        a' = qsort a
        b' = qsort b
    in (b' `par` a') ++ x:b'
qsort [] = []

compiled with ghc -O2 -threaded runs in about 11.0 seconds on one capability. Add +RTS -N4, and it runs in 7.1 seconds.

Ta da! An improvement.

(In contrast, the version with System.Random runs in about 13 seconds for the non-parallel version, about 12 seconds for the parallel version on one capability (probably just because of some minor strictness improvement), and slows down substantially for each additional capability added; timings are erratic, too, though I'm not quite sure why.)

Splitting up `partition`

One obvious problem with this version is that, even with a' = qsort a and b' = qsort b running in parallel, they're tied to the same sequential partition call. By dividing this up into two filters:

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let a = qsort $ filter (<=x) xs
        b = qsort $ filter (>x)  xs
    in b `par` a ++ x:b
qsort [] = []

we speed things up to about 5.5 seconds with -N4. To be fair, even the non-parallel version is actually slightly faster with two filters in place of the partition call, at least when sorting Ints. There are probably some additional optimizations that are possible with the filters compared to the partition that make the extra comparisons worth it.

Reducing the number of sparks

Now, what you tried to do in pQuickSort above was to limit the parallel computations to the top-most set of recursions. Let's use the following psort to experiment with this:

psort :: Ord a => Int -> [a] -> [a]
psort n (x:xs)
  = let a = psort (n-1) $ filter (<=x) xs
        b = psort (n-1) $ filter (>x)  xs
    in if n > 0 then b `par` a ++ x:b else a ++ x:b
psort _ [] = []

This will parallelize the top n layers of the recursion. My particular LCG example with a seed of 1 (i.e., iterate lcg 1) recurses up to 54 layers, so psort 55 should give the same performance as the fully parallel version except for the overhead of keeping track of layers. When I run it, I get a time of about 5.8 seconds with -N4, so the overhead is quite small.

Now, look what happens as we reduce the number of layers:

| Layers |  55 |  40 |  30 |  20 |  10 |   5 |   3 |    1 |
|--------+-----+-----+-----+-----+-----+-----+-----+------|
| time   | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |

Note that, at the lowest layers, there's little to be gained from parallel computation. This is mostly because the average depth of the tree is probably around 25 layers or so, so there's only a handful of computations at 50 layers many with weird, lop-sided partitions, and they're certainly too small to parallelize. On the flip side, there doesn't seem to be any penalty for those extra par calls.

Meanwhile, there are increasing gains all the way down to at least 20 layers, so trying to artificially limit the total number of sparks to 16 (e.g., top 4 or 5 layers), is a big loss.

Elbertine answered 28/4, 2019 at 21:3 Comment(1)

Awarding the bounty to this answer, although the parallelism gain is hardly awesome here, because it actually keeps to something that could be called functional programming. The uncomfortable conclusion is that even today, to get really good performance out of a modern processor, you still need to put a significant amount of workload into ugly low-level code. (Luckily Haskell is good for wrapping that low-level code into libraries, but still, pity.) – Rabbet 1/5, 2019 at 10:49

There are couple of problems that have already been mentioned:

Using lists is not going to give the performance you are looking for. Even this sample implementation using vector is factor x50 faster than using lists, since it does in-place element swaps. For this reason my answer will include implementation using the array library massiv, rather than lists.
I tend to find Haskell scheduler far from perfect for CPU bound tasks, so, as @Edward Kmett noted in his answer, we need a work stealing scheduler, which I conveniently implemented for the above mentioned library: scheduler

-- A helper function that partitions a region of a mutable array.
unstablePartitionRegionM ::
     forall r e m. (Mutable r Ix1 e, PrimMonad m)
  => MArray (PrimState m) r Ix1 e
  -> (e -> Bool)
  -> Ix1 -- ^ Start index of the region
  -> Ix1 -- ^ End index of the region
  -> m Ix1
unstablePartitionRegionM marr f start end = fromLeft start (end + 1)
  where
    fromLeft i j
      | i == j = pure i
      | otherwise = do
        x <- A.unsafeRead marr i
        if f x
          then fromLeft (i + 1) j
          else fromRight i (j - 1)
    fromRight i j
      | i == j = pure i
      | otherwise = do
        x <- A.unsafeRead marr j
        if f x
          then do
            A.unsafeWrite marr j =<< A.unsafeRead marr i
            A.unsafeWrite marr i x
            fromLeft (i + 1) j
          else fromRight i (j - 1)
{-# INLINE unstablePartitionRegionM #-}

Here is the actual in-place quicksort

quicksortMArray ::
     (Ord e, Mutable r Ix1 e, PrimMonad m)
  => Int
  -> (m () -> m ())
  -> A.MArray (PrimState m) r Ix1 e
  -> m ()
quicksortMArray numWorkers schedule marr =
  schedule $ qsort numWorkers 0 (unSz (msize marr) - 1)
  where
    qsort n !lo !hi =
      when (lo < hi) $ do
        p <- A.unsafeRead marr hi
        l <- unstablePartitionRegionM marr (< p) lo hi
        A.unsafeWrite marr hi =<< A.unsafeRead marr l
        A.unsafeWrite marr l p
        if n > 0
          then do
            let !n' = n - 1
            schedule $ qsort n' lo (l - 1)
            schedule $ qsort n' (l + 1) hi
          else do
            qsort n lo (l - 1)
            qsort n (l + 1) hi
{-# INLINE quicksortMArray #-}

Now if we look at the arguments numWorkers and schedule they are pretty opaque. Say if we supply 1 for the first argument and id for the second one, we will simply have a sequential quicksort, but if we would have a function available to us that could schedule each task to be computed concurrently, then we would get a parallel implementation of a quicksort. Luckily for us massiv provides it out of the box withMArray:

withMArray ::
     (Mutable r ix e, MonadUnliftIO m)
  => Array r ix e
  -> (Int -> (m () -> m ()) -> MArray RealWorld r ix e -> m a)
  -> m (Array r ix e)

Here is a pure version that will make a copy of an array and than sort it in palce using the computation strategy specified within the array itself:

quicksortArray :: (Mutable r Ix1 e, Ord e) => Array r Ix1 e -> Array r Ix1 e
quicksortArray arr = unsafePerformIO $ withMArray arr quicksortMArray
{-# INLINE quicksortArray #-}

Here comes the best part, the benchmarks. The order of results:

Intro sort from vector-algorithms
In-place quicksort using vector from this answer
Implementation in C, which I grabbed from this question
Sequential quicksort using massiv
Same as above, but in parallel on a computer with a humble 3rd gen i7 quad core processor with hyperthreading

benchmarking QuickSort/Vector Algorithms
time                 101.3 ms   (93.75 ms .. 107.8 ms)
                     0.991 R²   (0.974 R² .. 1.000 R²)
mean                 97.13 ms   (95.17 ms .. 100.2 ms)
std dev              4.127 ms   (2.465 ms .. 5.663 ms)

benchmarking QuickSort/Vector  
time                 89.51 ms   (87.69 ms .. 91.92 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 92.67 ms   (91.54 ms .. 94.50 ms)
std dev              2.438 ms   (1.468 ms .. 3.493 ms)

benchmarking QuickSort/C       
time                 88.14 ms   (86.71 ms .. 89.41 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 90.11 ms   (89.17 ms .. 93.35 ms)
std dev              2.744 ms   (387.1 μs .. 4.686 ms)

benchmarking QuickSort/Array   
time                 76.07 ms   (75.77 ms .. 76.41 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 76.08 ms   (75.75 ms .. 76.28 ms)
std dev              453.7 μs   (247.8 μs .. 699.6 μs)

benchmarking QuickSort/Array Par
time                 25.25 ms   (24.84 ms .. 25.61 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 25.13 ms   (24.80 ms .. 25.75 ms)
std dev              991.6 μs   (468.5 μs .. 1.782 ms)

Benchmarks are sorting 1,000,000 random Int64s. If you'd like to see full code you can find it here: https://github.com/lehins/haskell-quicksort

To sum it up, we got a x3 time speed up on a quad core processor and 8 capabilities, which sounds pretty good to me. Thanks for this question, now I can add sorting function to massiv ;)

Edit

Note, that the accepted answer which uses lists instead of a more appropriate data structure for this problem such as a mutable array, is x100 times slower on the same input:

benchmarking List/random/List Par
time                 2.712 s    (2.566 s .. 3.050 s)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 2.696 s    (2.638 s .. 2.745 s)
std dev              59.09 ms   (40.83 ms .. 72.04 ms)
variance introduced by outliers: 19% (moderately inflated)

Averroes answered 27/4, 2019 at 21:44 Comment(3)

That's certainly some impressive performance! Too bad it's the kind of super-unidiomatic unsafe mutable code that's already well-dreaded from squeezing the most out of Data.Vector.Unboxed. Can scheduler also be used for this task with more higher-level code? (The purpose of my bounty was less to see a super-optimised solution as a simple solution that actually witnesses Haskell's supposed suitability for writing nice, purely functional parallelised code, not necessarily giving the best performance in absolute terms but giving reasonaly performance that scales well and easily.) – Rabbet 27/4, 2019 at 22:20

Yes, you can use scheduler library outside of the array context. But keep in mind that, because it does scheduling using forkIO, it has to live in IO, as such purely functional parallelised code will not be possible. Also quicksort has to be done in place, so without mutation and at least ST monad this question cannot be answered to your satisfaction ;) – Averroes 27/4, 2019 at 22:31

Is partitioning in place really the best way to go? I'd think you could get more parallelism by dividing the array into segments, letting each thread partition one segment into its own buckets, and then merging the buckets in parallel. – Hogue 1/5, 2019 at 3:7

I'm not sure how well it can work for the idiomatic quicksort, but it can work (to a somewhat weak extent) for the true imperative quicksort as shown by Roman in Sparking Imperatives.

He never did get good speedup, though. This really needs a real work-stealing deque that doesn't overflow like the spark queue to optimize properly.

Spitzer answered 4/11, 2013 at 5:5 Comment(1)

Am I hitting a wall in the spark queue? I am only using par 16 times to subdivide the problem down to my 16 capabilities. After that the algorithm is sequential. Maybe I am not understanding something fundamental about the nature of par, pseq and sparks? – Confessional 4/11, 2013 at 9:46

You won't get any noticeable improvement since your pseudo-quicksort involves list concatenation, which cannot be parallelized and requires quadratic time (total time for all concatenations). I'd recommend you to try working with a mergesort, which is O(n log n) on linked lists.

Also, to run the program with large number of threads you should compile it with -rtsopts.

Quaff answered 4/11, 2013 at 6:41 Comment(4)

I have added a merge sort implementation. I have taken care to make the algorithm only traverse lists once during splitting and merging which should be optimal. It shows the same symptoms as the quick sort however. Slower the more capabilities you throw at it. -rtsopts is only needed for more than 24 capabilities. – Confessional 4/11, 2013 at 9:29

How is list concatenation quadratic in time? – Rabbet 4/11, 2013 at 9:30

I mean, total time for all concatenations is quadratic – Quaff 4/11, 2013 at 9:45

List concatenation can be parallelized. The trick is to have the main thread set up the concatenation, saving pointers to the boundaries between segments in the result. Those pointers can be passed to new threads along with appropriate sizes; each thread then forces its share of the result (using drop). – Hogue 24/4, 2019 at 16:40

par only evaluates the first argument to weak head normal form. That's to say: if the first argument's type is Maybe Int then par will check whether the result is Nothing or Just something and stop. It won't evaluate something at all. Likewise for lists it only evaluates enough to check whether the list is [] or something:something_else. To evaluate the whole list in parallel: you don't pass the list directly to par, you make an expression that depends on the list in such a way that when you pass it to par the entire list is needed. For example:

evalList :: [a] -> ()
evalList [] = ()
evalList (a:r) = a `pseq` evalList r

pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
  let (left, right) = splitList xs
  in  merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
  let (left, right) = splitList xs
      l = pMergeSort (n `div` 2) left
      r = pMergeSort (n `div` 2) right
  in  (evalList r `par` l) `pseq` (merge l r)

Another note: the overhead for launching new threads in Haskell is really low, so the case for pMergeSort 0 ... is probably not useful.

Perla answered 8/7, 2015 at 3:42 Comment(1)

The case for 0 is still useful, both because there is still some overhead and because the runtime drops work when the spark queues overflow. See Kmett's sort-of answer. – Hogue 24/4, 2019 at 2:28

I'm not sure this is worth noting, given @lehins excellent answer, but...

Why your `pQuickSort` doesn't work

u = [x] ++ pQuicksort (n `div` 2) upper

to WHNF, namely u = x : UNEVALUATED_THUNK, so your sparks aren't doing any real work.

Observing an improvement with a simple pseudo-quicksort

import Data.List

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
    in qsort a ++ x:qsort b
qsort [] = []

randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
  where lcg x = (a * x + c) `rem` m
        a = 1664525
        c = 1013904223
        m = 2^32

main :: IO ()
main = do
  let randints = randomList 5000000
  print . sum $ qsort randints

when compiled with GHC 8.6.4 and -O2, runs in about 9.7 seconds. The following "parallelized" version:

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let (a,b) = partition (<=x) xs
        a' = qsort a
        b' = qsort b
    in (b' `par` a') ++ x:b'
qsort [] = []

compiled with ghc -O2 -threaded runs in about 11.0 seconds on one capability. Add +RTS -N4, and it runs in 7.1 seconds.

Ta da! An improvement.

Splitting up `partition`

qsort :: Ord a => [a] -> [a]
qsort (x:xs)
  = let a = qsort $ filter (<=x) xs
        b = qsort $ filter (>x)  xs
    in b `par` a ++ x:b
qsort [] = []

Reducing the number of sparks

Now, what you tried to do in pQuickSort above was to limit the parallel computations to the top-most set of recursions. Let's use the following psort to experiment with this:

psort :: Ord a => Int -> [a] -> [a]
psort n (x:xs)
  = let a = psort (n-1) $ filter (<=x) xs
        b = psort (n-1) $ filter (>x)  xs
    in if n > 0 then b `par` a ++ x:b else a ++ x:b
psort _ [] = []

Now, look what happens as we reduce the number of layers:

| Layers |  55 |  40 |  30 |  20 |  10 |   5 |   3 |    1 |
|--------+-----+-----+-----+-----+-----+-----+-----+------|
| time   | 5.5 | 5.6 | 5.7 | 5.4 | 7.0 | 8.9 | 9.8 | 10.2 |

Meanwhile, there are increasing gains all the way down to at least 20 layers, so trying to artificially limit the total number of sparks to 16 (e.g., top 4 or 5 layers), is a big loss.

Elbertine answered 28/4, 2019 at 21:3 Comment(1)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Why your pQuickSort doesn't work

Observing an improvement with a simple pseudo-quicksort

Splitting up partition

Reducing the number of sparks

Why your pQuickSort doesn't work

Observing an improvement with a simple pseudo-quicksort

Splitting up partition

Reducing the number of sparks

Recommended topics

Hot tags

Why your `pQuickSort` doesn't work

Splitting up `partition`

Why your `pQuickSort` doesn't work

Splitting up `partition`