Haskell Repa stencil hacks
Asked Answered
K

1

8

The Problem

I'm trying to understand how Repa works and I was plaing with a "blur" example code from Repa Examples package. The code uses stencil2 Quasi Quote:

[stencil2|   2  4  5  4  2
             4  9 12  9  4
             5 12 15 12  5
             4  9 12  9  4
             2  4  5  4  2 |]

Which is simply TemplateHaskell snippet, which generates a function:

makeStencil2 5 5 coeffs where
     {-# INLINE[~0] coeffs #-}
     coeffs = \ ix -> case ix of
                      Z :. -2 :. -2 -> Just 2
                      Z :. -2 :. -1 -> Just 4
                      Z :. -2 :. 0 -> Just 5
                      Z :. -2 :. 1 -> Just 4
                      Z :. -2 :. 2 -> Just 2
                      [...]
                      _ -> Nothing

Its ok to use TH, but I would love to keep the coefs in an Repa Array, so I've changed the code to use an Repa Array instead, but my code works 2 times slower comparing to the original one.

Some fancy notes

I've noticed, that Repa authors use hardcoded 7 by 7 matrix of values to get coefficients: http://hackage.haskell.org/package/repa-3.2.3.3/docs/src/Data-Array-Repa-Stencil-Dim2.html#forStencil2 (see: template7x7)

Questions

  1. I want to ask you why it is not optimised as original one and how can we fix it? I want to write a "convolve" function, which will allow me to run convolution of an stencil (a Repa Array) over an image.
  2. Do we really have to use such hardcoded matrixes to make GHC optimize the code? There is really no way to create fast Haskell code without using such "hacks"?

The Code

Original blur function:

blur    :: Monad m => Int -> Array U DIM2 Double -> m (Array U DIM2 Double)
blur !iterations arrInit
 = go iterations arrInit
 where  go !0 !arr = return arr
        go !n !arr  
         = do   arr'    <- computeP
                         $ A.smap (/ 159)
                         $ forStencil2 BoundClamp arr
                           [stencil2|   2  4  5  4  2
                                        4  9 12  9  4
                                        5 12 15 12  5
                                        4  9 12  9  4
                                        2  4  5  4  2 |]
                go (n-1) arr'

my blur function:

blur !iterations arrInit = go iterations arrInit
    where 
          stencilx7 = fromListUnboxed (Z :. 7 :. 7) 
                    [  0,  0,  0,  0,  0,  0, 0
                    ,  0,  2,  4,  5,  4,  2, 0
                    ,  0,  4,  9, 12,  9,  4, 0
                    ,  0,  5, 12, 15, 12,  5, 0
                    ,  0,  4,  9, 12,  9,  4, 0
                    ,  0,  2,  4,  5,  4,  2, 0
                    ,  0,  0,  0,  0,  0,  0, 0
                    ] :: Array U DIM2 Int
          magicf (Z :. x :. y) = Just $ fromIntegral $ unsafeIndex stencilx7 (Z:. (x+3) :. (y+3))
          go !0 !arr = return arr
          go !n !arr  
           = do   
                  arr'    <- computeP
                           $ A.smap (/ 159)
                           $ A.forStencil2 BoundClamp arr 
                            $ makeStencil2 5 5 magicf
                  go (n-1) arr'

Rest of the code:

{-# LANGUAGE PackageImports, BangPatterns, TemplateHaskell, QuasiQuotes #-}
{-# OPTIONS -Wall -fno-warn-missing-signatures -fno-warn-incomplete-patterns #-}

import Data.List
import Control.Monad
import System.Environment
import Data.Word
import Data.Array.Repa.IO.BMP
import Data.Array.Repa.IO.Timing
import Data.Array.Repa                          as A
import qualified Data.Array.Repa.Repr.Unboxed   as U
import Data.Array.Repa.Stencil                  as A
import Data.Array.Repa.Stencil.Dim2             as A
import Prelude                                  as P

main 
 = do   args    <- getArgs
        case args of
         [iterations, fileIn, fileOut]  -> run (read iterations) fileIn fileOut
         _                              -> usage

usage   = putStr $ unlines
        [ "repa-blur <iterations::Int> <fileIn.bmp> <fileOut.bmp>" ]


-- | Perform the blur.
run :: Int -> FilePath -> FilePath -> IO ()
run iterations fileIn fileOut
 = do   arrRGB  <- liftM (either (error . show) id) 
                $  readImageFromBMP fileIn

        arrRGB `deepSeqArray` return ()
        let (arrRed, arrGreen, arrBlue) = U.unzip3 arrRGB
        let comps                       = [arrRed, arrGreen, arrBlue]

        (comps', tElapsed)
         <- time $ P.mapM (process iterations) comps

        putStr $ prettyTime tElapsed

        let [arrRed', arrGreen', arrBlue'] = comps'
        writeImageToBMP fileOut
                (U.zip3 arrRed' arrGreen' arrBlue')


process :: Monad m => Int -> Array U DIM2 Word8 -> m (Array U DIM2 Word8)
process iterations 
        = promote >=> blur iterations >=> demote
{-# NOINLINE process #-}


promote :: Monad m => Array U DIM2 Word8 -> m (Array U DIM2 Double)
promote arr
 = computeP $ A.map ffs arr
 where  {-# INLINE ffs #-}
        ffs     :: Word8 -> Double
        ffs x   =  fromIntegral (fromIntegral x :: Int)
{-# NOINLINE promote #-}


demote  :: Monad m => Array U DIM2 Double -> m (Array U DIM2 Word8)
demote arr
 = computeP $ A.map ffs arr

 where  {-# INLINE ffs #-}
        ffs     :: Double -> Word8
        ffs x   =  fromIntegral (truncate x :: Int)

Compile with: ghc -O2 -threaded -fllvm -fforce-recomp Main.hs -ddump-splices

Koumis answered 3/11, 2013 at 2:47 Comment(0)
I
1
  1. Reading convolution coeffs from the array theoretically couldn't be as fast as soldering constants right in the compiled code, because the latter approach costs nothing on machine level.

  2. No, GHC is able to boil down arbitrarily sized static stencils. See my implementation of static convolutions with fixed-vectors of lambdas:

    [dim2St| 1   2   1
             0   0   0
            -1  -2  -1 |]
    -->
    Dim2Stencil
     n3
     n3
     (VecList
        [VecList
           [\ acc a -> return (acc + a),
            \ acc a -> (return $ (acc + (2 * a))),
            \ acc a -> return (acc + a)],
         VecList
           [\ acc _ -> return acc,
            \ acc _ -> return acc,
            \ acc _ -> return acc],
         VecList
           [\ acc a -> return (acc - a),
            \ acc a -> (return $ (acc + (-2 * a))),
            \ acc a -> return (acc - a)]])
     (\ acc a reduce -> reduce acc a)
     (return 0)
    
Immobile answered 3/11, 2013 at 10:18 Comment(7)
1. But If such array is known at compile time (it is known in this example) and additional we inline each funciton, which uses it, GHC could be able to optimize it the same way, couldnt it?Koumis
2. I know it, because the stencil produces lambda function at compile time also (see the ddump-spliced code), but can we use simple "static" array known at compile time instead? This would give me the flexibility to produce faster code if such array is defined statically and slower if it is fed dynamically.Koumis
Additiona, would you be so nice and tell me please, what is really happening under the hood? Theoretically the program has to go over each pixel of image and multiply it with coefficients from stencil. So this is simply array ber array convolution. How could such static code be optimized, that it is faster than such simple multiplication? (I simply want to understand the optimizations, whcih are happening here).Koumis
1. If you don't need benefits of dynamicity (e. g. pluggable stencil at runtime, someway), why couldn't you just use quasi quoter?Immobile
2. In Haskell, flexibility could be introduced easily at any point, isn't it? Write, for example, type class ConvolverImmobile
If you convolves array with array, at each pixel you need 1) read pixel value from the image and store it in the register 2) read coeff and store it in the register 3) multiply them. With static stencil, 1) read value from the image and store it in the register; 2) multiply the register inplace by the immediate constant. Computing convolution as 5x5 is register intensive and I suspect the first version is additionaly suffers from register lack, and need to save computed values on the stack, or can't utilize all 4 arithmetic units (that is just an assumption).Immobile
let us continue this discussion in chatImmobile

© 2022 - 2024 — McMap. All rights reserved.