In Repa, I would like to apply a certain d
-dimensional linear transform in parallel across the innermost dimension of my array, i.e., on all "column" vectors.
In general, such a transformation can be expressed as a matrix M
, and each entry of M*v
is just a dot product of the appropriate row of M
with v
. So I could just use traverse
with a function that computes the appropriate dot product. This costs d^2
work.
However, my M
is special: it admits a linear-work sequential algorithm. For example, M
might be a lower-triangular matrix with 1
s throughout the lower triangle. Then M*v
is just the vector of partial sums of v
(a.k.a. "scan"). These sums can be computed sequentially in an obvious way, but one needs the (i-1)
st entry of the result to compute the i
th entry efficiently. (I have several such M
, all of which can be computed in one way or the other in linear sequential time.)
I don't see any obvious way to use traverse
(or any other Repa functions) to take advantage of this property of M
. Can it be done? It will be quite wasteful to use a d^2
-work algorithm (even with abundant parallelism) when there's such a fast linear-work algorithm available.
(I've seen some old SO posts (e.g., here) asking similar questions, but nothing that quite matches my situation.)
UPDATE
By request here is some illustrative code, for the M
that computes partial sums (as described above). As I expected, the runtime (work) grows super-linearly in d
, the second argument of the extent of the array (ext
). This comes from the fact that mulM'
only specifies how to compute the i
th entry of the output, independent of all other entries. Even though there is a linear-time algorithm in the total size of the array, I don't know how to express it in Repa.
Interestingly, if I remove the line that defines the manifest array'
from main
, then the runtime scales only linearly in the total size of the array! So when the arrays are delayed "all the way down," fusion/optimization must somehow be extracting the linear-work algorithm, but without any explicit help from me. That's amazing, but also not very useful to me, because in reality, I'll need to call mulM
on manifest arrays.
{-# LANGUAGE TypeOperators, ScopedTypeVariables, FlexibleContexts #-}
module Main where
import Data.Array.Repa as R
-- multiplication by M across innermost dimension
mulM arr = traverse arr id mulM'
where mulM' _ idx@(i' :. i) =
sumAllS $ extract (Z:.0) (Z:.(i+1)) $ slice arr (i' :. All)
ext = Z :. (1000000::Int) :. (10::Int) -- super-linear runtime in 2nd arg
--ext = Z :. (10::Int) :. (1000000::Int) -- takes forever
array = fromFunction ext (\(Z:.j:.i) -> j+i)
main :: IO ()
main = do
-- apply mulM to a manifest array
array' :: Array U DIM2 Int <- computeP $ array
ans :: Array U DIM2 Int <- computeP $ mulM array'
print "done"
scan
using existing library functions? Since the constructors for all of the array types are exported, you could probably write your own function on the underlying representation (ie, aVector
forArray U
). – Ailssascan
, but would be fine with doing so if it solved my problem. However, I am skeptical that it would do so, because exploiting the underlyingVector
representation means losing multidimensionality, index transforms, etc. that I need to use. – MillisentmulM
does not help. (My previous deleted comment said that it did help, but I was running the wrong code.) Only when the arrays are "delayed all the way down" do I get linear runtime. – Millisent