many parallel applications of a sequential transform in repa

In Repa, I would like to apply a certain d-dimensional linear transform in parallel across the innermost dimension of my array, i.e., on all "column" vectors.

In general, such a transformation can be expressed as a matrix M, and each entry of M*v is just a dot product of the appropriate row of M with v. So I could just use traverse with a function that computes the appropriate dot product. This costs d^2 work.

However, my M is special: it admits a linear-work sequential algorithm. For example, M might be a lower-triangular matrix with 1s throughout the lower triangle. Then M*v is just the vector of partial sums of v (a.k.a. "scan"). These sums can be computed sequentially in an obvious way, but one needs the (i-1) st entry of the result to compute the ith entry efficiently. (I have several such M, all of which can be computed in one way or the other in linear sequential time.)

I don't see any obvious way to use traverse (or any other Repa functions) to take advantage of this property of M. Can it be done? It will be quite wasteful to use a d^2-work algorithm (even with abundant parallelism) when there's such a fast linear-work algorithm available.

(I've seen some old SO posts (e.g., here) asking similar questions, but nothing that quite matches my situation.)

UPDATE

By request here is some illustrative code, for the M that computes partial sums (as described above). As I expected, the runtime (work) grows super-linearly in d, the second argument of the extent of the array (ext). This comes from the fact that mulM' only specifies how to compute the ith entry of the output, independent of all other entries. Even though there is a linear-time algorithm in the total size of the array, I don't know how to express it in Repa.

Interestingly, if I remove the line that defines the manifest array' from main, then the runtime scales only linearly in the total size of the array! So when the arrays are delayed "all the way down," fusion/optimization must somehow be extracting the linear-work algorithm, but without any explicit help from me. That's amazing, but also not very useful to me, because in reality, I'll need to call mulM on manifest arrays.

{-# LANGUAGE TypeOperators, ScopedTypeVariables, FlexibleContexts #-}

module Main where

import Data.Array.Repa as R

-- multiplication by M across innermost dimension
mulM arr = traverse arr id mulM'
    where mulM' _ idx@(i' :. i) =
              sumAllS $ extract (Z:.0) (Z:.(i+1)) $ slice arr (i' :. All)

ext = Z :. (1000000::Int) :. (10::Int) -- super-linear runtime in 2nd arg
--ext = Z :. (10::Int) :. (1000000::Int) -- takes forever

array = fromFunction ext (\(Z:.j:.i) -> j+i)

main :: IO ()
main = do
  -- apply mulM to a manifest array
  array' :: Array U DIM2 Int <- computeP $ array
  ans :: Array U DIM2 Int <- computeP $ mulM array'
  print "done"

Recommended topics

Hot tags