many parallel applications of a sequential transform in repa
Asked Answered
M

0

42

In Repa, I would like to apply a certain d-dimensional linear transform in parallel across the innermost dimension of my array, i.e., on all "column" vectors.

In general, such a transformation can be expressed as a matrix M, and each entry of M*v is just a dot product of the appropriate row of M with v. So I could just use traverse with a function that computes the appropriate dot product. This costs d^2 work.

However, my M is special: it admits a linear-work sequential algorithm. For example, M might be a lower-triangular matrix with 1s throughout the lower triangle. Then M*v is just the vector of partial sums of v (a.k.a. "scan"). These sums can be computed sequentially in an obvious way, but one needs the (i-1) st entry of the result to compute the ith entry efficiently. (I have several such M, all of which can be computed in one way or the other in linear sequential time.)

I don't see any obvious way to use traverse (or any other Repa functions) to take advantage of this property of M. Can it be done? It will be quite wasteful to use a d^2-work algorithm (even with abundant parallelism) when there's such a fast linear-work algorithm available.

(I've seen some old SO posts (e.g., here) asking similar questions, but nothing that quite matches my situation.)

UPDATE

By request here is some illustrative code, for the M that computes partial sums (as described above). As I expected, the runtime (work) grows super-linearly in d, the second argument of the extent of the array (ext). This comes from the fact that mulM' only specifies how to compute the ith entry of the output, independent of all other entries. Even though there is a linear-time algorithm in the total size of the array, I don't know how to express it in Repa.

Interestingly, if I remove the line that defines the manifest array' from main, then the runtime scales only linearly in the total size of the array! So when the arrays are delayed "all the way down," fusion/optimization must somehow be extracting the linear-work algorithm, but without any explicit help from me. That's amazing, but also not very useful to me, because in reality, I'll need to call mulM on manifest arrays.

{-# LANGUAGE TypeOperators, ScopedTypeVariables, FlexibleContexts #-}

module Main where

import Data.Array.Repa as R

-- multiplication by M across innermost dimension
mulM arr = traverse arr id mulM'
    where mulM' _ idx@(i' :. i) =
              sumAllS $ extract (Z:.0) (Z:.(i+1)) $ slice arr (i' :. All)

ext = Z :. (1000000::Int) :. (10::Int) -- super-linear runtime in 2nd arg
--ext = Z :. (10::Int) :. (1000000::Int) -- takes forever

array = fromFunction ext (\(Z:.j:.i) -> j+i)

main :: IO ()
main = do
  -- apply mulM to a manifest array
  array' :: Array U DIM2 Int <- computeP $ array
  ans :: Array U DIM2 Int <- computeP $ mulM array'
  print "done"
Millisent answered 11/6, 2014 at 0:4 Comment(9)
Is this question specifically about how to write scan using existing library functions? Since the constructors for all of the array types are exported, you could probably write your own function on the underlying representation (ie, a Vector for Array U).Ailssa
I wasn't asking about how to write scan, but would be fine with doing so if it solved my problem. However, I am skeptical that it would do so, because exploiting the underlying Vector representation means losing multidimensionality, index transforms, etc. that I need to use.Millisent
Be more specific. Publish your "correct but inefficient" code.Seel
@Seel I've updated my post with code.Millisent
What about precomputing an array of partial sums?Seel
@Seel I've thought of that, and will try it out. However, I think it inefficient in space, because it creates a separate manifest array (or vector) of partial sums for each column of the input. That's linear auxiliary space (possibly it could be optimized away...).Millisent
I am inclined to think that you can't do this with repa. After all it is designed to act in parallel on all the elements. Why are your arrays manifest? Can you delay them somehow and then allow fusion to work its magic?Nagel
@DominicSteinitz I need manifest arrays at various places, for sharing (otherwise performance is terrible). Delaying a manifest array before feeding it to mulM does not help. (My previous deleted comment said that it did help, but I was running the wrong code.) Only when the arrays are "delayed all the way down" do I get linear runtime.Millisent
Addendum: delaying a manifest array before feeding it to mulM does help a little (around 2.5x speedup), and I think I see why. But the runtime still is essentially quadratic in the "column" size.Millisent

© 2022 - 2024 — McMap. All rights reserved.