Write a parallel array Haskell expression once, run on CPUs & GPUs with repa and accelerate
Asked Answered
L

2

13

Repa and Accelerate API Similarity

The Haskell repa library is for automatically parallel array computation on CPUs. The accelerate library is automatic data parallelism on GPUs. The APIs are quite similar, with identical representations of N-dimensional arrays. One can even switch between accelerate and repa arrays with fromRepa and toRepa in Data.Array.Accelerate.IO:

fromRepa :: (Shapes sh sh', Elt e) => Array A sh e -> Array sh' e
toRepa   :: Shapes sh sh'          => Array sh' e  -> Array A sh e

There are multiple backends for accelerate, including LLVM, CUDA and FPGA (see Figure 2 of http://www.cse.unsw.edu.au/~keller/Papers/acc-cuda.pdf). I've spotted a repa backend for accelerate, though the library doesn't appear to be maintained. Given that the repa and accelerate programming models are similar, I am hopeful that there is an elegant way of switching between them i.e. functions written once can be executed with repa's R.computeP or with one of accelerate's backends e.g. with the CUDA run function.

Two very similar functions: Repa and Accelerate on a Pumpkin

Take a simple image processing thresholding function. If a grayscale pixel value is less than 50, then it is set to 0, otherwise it retains its value. Here's what it does to a pumpkin:

The following code presents repa and accelerate implementations:

module Main where

import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.IO.BMP as R
import qualified Data.Array.Accelerate as A
import qualified Data.Array.Accelerate.IO as A
import qualified Data.Array.Accelerate.Interpreter as A

import Data.Word

-- Apply threshold over image using accelerate (interpreter)
thresholdAccelerate :: IO ()
thresholdAccelerate = do
  img <- either (error . show) id `fmap` A.readImageFromBMP "pumpkin-in.bmp"
  let newImg = A.run $ A.map evalPixel (A.use img)
  A.writeImageToBMP "pumpkin-out.bmp" newImg
    where
      -- *** Exception: Prelude.Ord.compare applied to EDSL types
      evalPixel :: A.Exp A.Word32 -> A.Exp A.Word32
      evalPixel p = if p > 50 then p else 0

-- Apply threshold over image using repa
thresholdRepa :: IO ()
thresholdRepa = do
  let arr :: IO (R.Array R.U R.DIM2 (Word8,Word8,Word8))
      arr = either (error . show) id `fmap` R.readImageFromBMP "pumpkin-in.bmp" 
  img <- arr
  newImg <- R.computeP (R.map applyAtPoint img)
  R.writeImageToBMP "pumpkin-out.bmp" newImg
  where
    applyAtPoint :: (Word8,Word8,Word8) -> (Word8,Word8,Word8)
    applyAtPoint (r,g,b) =
        let [r',g',b'] = map applyThresholdOnPixel [r,g,b]
        in (r',g',b')
    applyThresholdOnPixel x = if x > 50 then x else 0

data BackendChoice = Repa | Accelerate

main :: IO ()
main = do
  let userChoice = Repa -- pretend this command line flag
  case userChoice of
    Repa       -> thresholdRepa
    Accelerate -> thresholdAccelerate

Question: can I write this only once?

The implementations of thresholdAccelerate and thresholdRepa are very similar. Is there an elegant way to write array processing functions once, then opt for multicore CPUs (repa) or GPUs (accelerate) in a switch programmatically? I can think of choosing my import in accordance with whether I want CPU or GPU i.e. to import either Data.Array.Accelerate.CUDA or Data.Array.Repa to execute an action of type Acc a with:

run :: Arrays a => Acc a -> a

Or, to use a type class e.g. something roughly like:

main :: IO ()
main = do
  let userChoice = Repa -- pretend this is a command line flag
  action <- case userChoice of
    Repa       -> applyThreshold :: RepaBackend ()
    Accelerate -> applyThreshold :: CudaBackend ()
  action

Or is it the case that, for each parallel array function I wish to express for both CPUs and GPUs, I must implement it twice --- once with the repa library and again with the accelerate library?

Landri answered 21/4, 2014 at 17:13 Comment(3)
Accelerate has a repa backend: github.com/blambo/accelerate-repa - never used it but I would be glad to know whether that's an interesting enough fallback.Armure
Right, it seems as though the intention would be to use a repa backend and write your code with accelerate, but I guess there's not much interest in that for whatever reason.Lemuellemuela
Yeah, but I thought it could be appropriate exactly because accelerate allows us to write the function once and have it run on parallel CPU or on GPU depending on which you want. It's a module switch instead of flipping a constructor, but you can write the latter on top of the former.Armure
H
9

The short answer is that, at the moment, you unfortunately need to write both versions.

However, we are working on CPU support for Accelerate, which will obviate the need for the Repa version of the code. In particular, Accelerate very recently gained a new LLVM-based backend that targets both GPUs and CPUs: https://github.com/AccelerateHS/accelerate-llvm

This new backend is still incomplete, buggy, and experimental, but we are planning to make it into a viable alternative to the current CUDA backend.

Hairy answered 22/4, 2014 at 20:46 Comment(3)
Great! What is the longer term story w.r.t. the LLVM backend of accelerate and repa? Will one deprecate the other? Is the LLVM IR expressive enough to implement the type indexed array representations e.g. delayed arrays? Or is the plan to keep repa and accelerate-llvm separate projects? Once the LLVM backend is mature, I guess an obvious benchmark for it is repa?Landri
Accelerate uses delayed arrays as well, but it automatically decides what array representation to use. That is easier in Accelerate than in Repa as Accelerate is a more restrictive language, so I’m hopeful that we can do without the type-level hints for representations as are needed in Repa. We have no plans to continue developing the accelerate-repa backend. The new LLVM backend is already faster than plain Repa code on an admittedly very limited set of benchmarks.Hairy
This should probably be updated since CPU support is there now :)Jinajingle
A
3

I thought about this a year and few months ago while designing yarr. At that time there were serious issues with type families inference or something like this (I don't remember exactly) which prevented to implement such unifying wrapper of vector, repa, yarr, accelerate, etc. both efficiently and allowing not to write too many explicit type signatures, or implement it in principle (I don't remember).

That was GHC 7.6. I don't know if there meaningful improvements in GHC 7.8 in this field. Theoretically I didn't see any problems, thus we can expect such stuff someday, in short or long time, when GHC will be ready.

Avra answered 21/4, 2014 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.