How to take an array slice with Repa over a range

R

2

7

I am attempting to implement a cumulative sum function using Repa in order to calculate integral images. My current implementation looks like the following:

cumsum :: (Elt a, Num a) => Array DIM2 a -> Array DIM2 a
cumsum array = traverse array id cumsum' 
    where
        elementSlice inner outer = slice array (inner :. (0 :: Int))
        cumsum' f (inner :. outer) = Repa.sumAll $ elementSlice inner outer

The problem is in the elementSlice function. In matlab or say numpy this could be specified as array[inner,0:outer]. So what I am looking for is something along the lines of:

slice array (inner :. (Range 0 outer))

However, it seems that slices are not allowed to be specified over ranges currently in Repa. I considered using partitions as discussed in Efficient Parallel Stencil Convolution in Haskell, but this seems like a rather heavyweight approach if they would be changed every iteration. I also considered masking the slice (multiplying by a binary vector) - but this again seemed like it would perform very poorly on large matrices since I would be allocating a mask vector for every point in the matrix...

My question - does anyone know if there are plans to add support for slicing over a range to Repa? Or is there a performant way I can already attack this problem, maybe with a different approach?

Rew answered 29/5, 2011 at 19:44 Comment(0)

A

3

Actually, I think the main problem is that Repa does not have a scan primitive. (However, a very similar library Accelerate does.) There are two variants of scan, prefix scan and suffix scan. Given a 1D array

[a_1, ..., a_n]

prefix scan returns

[0, a_0, a_0 + a_1, ..., a_0 + ... + a_{n-1} ]

while a suffix scan produces

[a_0, a_0 + a_1, ..., a_0 + a_1 + ... + a_n ]

I assume this is what you are going for with your cumulative sum (cumsum) function.

Prefix and suffix scans generalise quite naturally to multi-dimensional arrays and have an efficient implementation based on tree reduction. A relatively old paper on the topic is "Scan Primitives for Vector Computers". Also, Conal Elliott has recently written several blog posts on deriving efficient parallel scans in Haskell.

The integral image (on a 2D array) can be calculated by doing two scans, one horizontally and one vertically. In the absence of a scan primitive I have implemented one, very inefficiently.

horizScan :: Array DIM2 Int -> Array DIM2 Int
horizScan arr = foldl addIt arr [0 .. n - 1]
  where 
    addIt :: Array DIM2 Int -> Int -> Array DIM2 Int
    addIt accum i = accum +^ vs
       where 
         vs = toAdd (i+1) n (slice arr (Z:.All:.i))
    (Z:.m:.n) = arrayExtent arr

--
-- Given an @i@ and a length @len@ and a 1D array @arr@ 
-- (of length n) produces a 2D array of dimensions n X len.
-- The columns are filled with zeroes up to row @i@.
-- Subsequently they are filled with columns equal to the 
-- values in @arr.
--
toAdd :: Int -> Int -> Array DIM1 Int -> Array DIM2 Int
toAdd i len arr = traverse arr (\sh -> sh:.(len::Int)) 
               (\_ (Z:.n:.m) -> if m >= i then arr ! (Z:.n) else 0)

The function for calculating the integral image can then be defined as

vertScan :: Array DIM2 Int -> Array DIM2 Int
vertScan = transpose . horizScan . transpose

integralImage = horizScan . vertScan

Given that scan has been implemented for Accelerate, it shouldn't be too hard to add it to Repa. I am not sure if an efficient implementation using the existing Repa primitives is possible or not.

Atalee answered 30/5, 2011 at 2:2 Comment(4)

I'd be very surprised if someone came up with an efficient implementation of scans for Repa. The model Repa uses for arrays is such that all elements can be computed independently. But for a scan each element depends on the previous element, at least of you want to be able to compute it efficiently. Hence, I believe you would have to come up with something quite different than Repa to support efficient scans. – Conspecific 30/5, 2011 at 13:18

Thanks, I really appreciate the answer! scans definitely fit what I need, and I had actually tried to implement one similar to how the Repa library itself implements fold, but got stuck and moved on to the code presented in the question. Either way, I read through that paper and Conal's posts - thanks much for those links. I haven't completely given up yet, going to continue reading Repa source and see if I can come up with anything. At least I can go with the inefficient version for now - and leave it for optimization later. – Rew 30/5, 2011 at 18:37

@svenningsson. Actually no, that's the interesting thing about scan. If your operator is binary associative then there is an efficient way to make use of multiple processors (albeit with some redundancy). Imagine the simpler task of summing the numbers 1 to 8. With one processor you accumulate a result as you go along, successively holding the value 1, 3, 6, 10, etc. With four processors. Pass 1: 1st sums 1&2, 2nd 3&4, 3rd 5&6, 4th 7&8. Pass 2. 1st adds 3&7, 2nd 11&15. (3rd & 4th not used). Pass 3: Add 10 and 26. This is a logarithmic number of steps. Scan is more complex, but similar. – Atalee 31/5, 2011 at 4:36

I'm well aware that it's possible to efficiently parallelize scan. My point is that I don't think the model Repa uses admit such an implementation. – Conspecific 31/5, 2011 at 9:14

P

5

Extracting a subrange is an index space manipulation that is easy enough to express with fromFunction, though we should probably add a nicer wrapper for it to the API.

let arr = fromList (Z :. (5 :: Int)) [1, 2, 3, 4, 5 :: Int] 
in  fromFunction (Z :. 3) (\(Z :. ix) -> arr ! (Z :. ix + 1))

> [2,3,4]

Elements in the result are retrieved by offsetting the provided index and looking that up from the source. This technique extends naturally to arrays of higher rank.

With respect to implementing parallel folds and scans, we would do this by adding a primitive for it to the library. We can't define parallel reductions in terms of map, but we can still use the overall approach of delayed arrays. It would be a reasonably orthogonal extension.

Pavior answered 31/5, 2011 at 6:42 Comment(1)

Ben, just wanted to thank you for the answer. I have a solution now that looks like gist.github.com/1009061 . All that remains is to perform bounds checking and generalize it to multiple dimensions. I am thinking something along the lines of cumsum (Z :. All :. All) arr similar to say extend to make it really convenient to sum across multiple dimensions (instead of transpose solution I am using currently). – Rew 5/6, 2011 at 15:37

A

3

Actually, I think the main problem is that Repa does not have a scan primitive. (However, a very similar library Accelerate does.) There are two variants of scan, prefix scan and suffix scan. Given a 1D array

[a_1, ..., a_n]

prefix scan returns

[0, a_0, a_0 + a_1, ..., a_0 + ... + a_{n-1} ]

while a suffix scan produces

[a_0, a_0 + a_1, ..., a_0 + a_1 + ... + a_n ]

I assume this is what you are going for with your cumulative sum (cumsum) function.

Prefix and suffix scans generalise quite naturally to multi-dimensional arrays and have an efficient implementation based on tree reduction. A relatively old paper on the topic is "Scan Primitives for Vector Computers". Also, Conal Elliott has recently written several blog posts on deriving efficient parallel scans in Haskell.

The integral image (on a 2D array) can be calculated by doing two scans, one horizontally and one vertically. In the absence of a scan primitive I have implemented one, very inefficiently.

horizScan :: Array DIM2 Int -> Array DIM2 Int
horizScan arr = foldl addIt arr [0 .. n - 1]
  where 
    addIt :: Array DIM2 Int -> Int -> Array DIM2 Int
    addIt accum i = accum +^ vs
       where 
         vs = toAdd (i+1) n (slice arr (Z:.All:.i))
    (Z:.m:.n) = arrayExtent arr

--
-- Given an @i@ and a length @len@ and a 1D array @arr@ 
-- (of length n) produces a 2D array of dimensions n X len.
-- The columns are filled with zeroes up to row @i@.
-- Subsequently they are filled with columns equal to the 
-- values in @arr.
--
toAdd :: Int -> Int -> Array DIM1 Int -> Array DIM2 Int
toAdd i len arr = traverse arr (\sh -> sh:.(len::Int)) 
               (\_ (Z:.n:.m) -> if m >= i then arr ! (Z:.n) else 0)

The function for calculating the integral image can then be defined as

vertScan :: Array DIM2 Int -> Array DIM2 Int
vertScan = transpose . horizScan . transpose

integralImage = horizScan . vertScan

Given that scan has been implemented for Accelerate, it shouldn't be too hard to add it to Repa. I am not sure if an efficient implementation using the existing Repa primitives is possible or not.

Atalee answered 30/5, 2011 at 2:2 Comment(4)

I'd be very surprised if someone came up with an efficient implementation of scans for Repa. The model Repa uses for arrays is such that all elements can be computed independently. But for a scan each element depends on the previous element, at least of you want to be able to compute it efficiently. Hence, I believe you would have to come up with something quite different than Repa to support efficient scans. – Conspecific 30/5, 2011 at 13:18

Thanks, I really appreciate the answer! scans definitely fit what I need, and I had actually tried to implement one similar to how the Repa library itself implements fold, but got stuck and moved on to the code presented in the question. Either way, I read through that paper and Conal's posts - thanks much for those links. I haven't completely given up yet, going to continue reading Repa source and see if I can come up with anything. At least I can go with the inefficient version for now - and leave it for optimization later. – Rew 30/5, 2011 at 18:37

@svenningsson. Actually no, that's the interesting thing about scan. If your operator is binary associative then there is an efficient way to make use of multiple processors (albeit with some redundancy). Imagine the simpler task of summing the numbers 1 to 8. With one processor you accumulate a result as you go along, successively holding the value 1, 3, 6, 10, etc. With four processors. Pass 1: 1st sums 1&2, 2nd 3&4, 3rd 5&6, 4th 7&8. Pass 2. 1st adds 3&7, 2nd 11&15. (3rd & 4th not used). Pass 3: Add 10 and 26. This is a logarithmic number of steps. Scan is more complex, but similar. – Atalee 31/5, 2011 at 4:36

I'm well aware that it's possible to efficiently parallelize scan. My point is that I don't think the model Repa uses admit such an implementation. – Conspecific 31/5, 2011 at 9:14

Recommended topics

Hot tags