I have got this seemingly trivial parallel quicksort implementation, the code is as follows:
import System.Random
import Control.Parallel
import Data.List
quicksort :: Ord a => [a] -> [a]
quicksort xs = pQuicksort 16 xs -- 16 is the number of sparks used to sort
-- pQuicksort, parallelQuicksort
-- As long as n > 0 evaluates the lower and upper part of the list in parallel,
-- when we have recursed deep enough, n==0, this turns into a serial quicksort.
pQuicksort :: Ord a => Int -> [a] -> [a]
pQuicksort _ [] = []
pQuicksort 0 (x:xs) =
let (lower, upper) = partition (< x) xs
in pQuicksort 0 lower ++ [x] ++ pQuicksort 0 upper
pQuicksort n (x:xs) =
let (lower, upper) = partition (< x) xs
l = pQuicksort (n `div` 2) lower
u = [x] ++ pQuicksort (n `div` 2) upper
in (par u l) ++ u
main :: IO ()
main = do
gen <- getStdGen
let randints = (take 5000000) $ randoms gen :: [Int]
putStrLn . show . sum $ (quicksort randints)
I compile with
ghc --make -threaded -O2 quicksort.hs
and run with
./quicksort +RTS -N16 -RTS
No matter what I do I can not get this to run faster than a simple sequential implementation running on one cpu.
- Is it possible to explain why this runs so much slower on several CPUs than on one?
- Is it possible to make this scale, at least sub linearly, with the number of CPUs by doing some trick?
EDIT: @tempestadept hinted that quick sort it self is the problem. To check this I implemented a simple merge sort in the same spirit as the example above. It has the same behaviour, performs slower the more capabilities you add.
import System.Random
import Control.Parallel
splitList :: [a] -> ([a], [a])
splitList = helper True [] []
where helper _ left right [] = (left, right)
helper True left right (x:xs) = helper False (x:left) right xs
helper False left right (x:xs) = helper True left (x:right) xs
merge :: (Ord a) => [a] -> [a] -> [a]
merge xs [] = xs
merge [] ys = ys
merge (x:xs) (y:ys) = case x<y of
True -> x : merge xs (y:ys)
False -> y : merge (x:xs) ys
mergeSort :: (Ord a) => [a] -> [a]
mergeSort xs = pMergeSort 16 xs -- we use 16 sparks
-- pMergeSort, parallel merge sort. Takes an extra argument
-- telling how many sparks to create. In our simple test it is
-- set to 16
pMergeSort :: (Ord a) => Int -> [a] -> [a]
pMergeSort _ [] = []
pMergeSort _ [a] = [a]
pMergeSort 0 xs =
let (left, right) = splitList xs
in merge (pMergeSort 0 left) (pMergeSort 0 right)
pMergeSort n xs =
let (left, right) = splitList xs
l = pMergeSort (n `div` 2) left
r = pMergeSort (n `div` 2) right
in (r `par` l) `pseq` (merge l r)
ris :: Int -> IO [Int]
ris n = do
gen <- getStdGen
return . (take n) $ randoms gen
main = do
r <- ris 100000
putStrLn . show . sum $ mergeSort r
pseq
, even when purging down any possible thunks withsum
. Perhaps there's an entirely different problem involved. — As I have now deleted by answer, here again as a comment: 1. naming that function justquicksort
might confuse since you wouldn't expect such a function to accept an extra parallelism argument; 2. Use type signatures, just always for top-level functions and even more so when they might work slightly different than what the name suggests; 3. use library functions such aspartition
if possible. — Good question, BTW. – Rabbetl `par` u `pseq` (u ++ l)
. (2) While you run sub-computations in parallel, they're not really evaluated until needed. So you should force each sub-list to NF (or at least its full structure), something likeforceList l `par` forceList u `pseq` (u ++ l)
whereforceList
is (your own) function that forces evaluation of a list. Also for proper benchmarking I suggest to use criterion. – Henleyonthames-rtsopts
flag, and add the-sstderr
flag when you then run the program. – Amylolysis