The most straightforward monadic 'stream' is just a list of monadic actions Monad m => [m a]
. The sequence :: [m a] -> m [a]
function evaluates each monadic action and collects the results. As it turns out, sequence
is not very efficient, though, because it operates on lists, and the monad is an obstacle to achieving fusion in anything but the simplest cases.
The question is: What is the most efficient approach for monadic streams?
To investigate this, I provide a toy problem along with a few attempts to improve performance. The source code can be found on github. The singular benchmark presented below may be misleading for more realistic problems, although I think it is a worst-case scenario of sorts, i.e. the most possible overhead per useful computation.
The toy problem
is a maximum length 16-bit Linear Feedback Shift Register (LFSR), implemented in C in a somewhat over-elaborate way, with a Haskell wrapper in the IO
monad. 'Over-elaborate' refers to the unnecessary use of a struct
and its malloc
- the purpose of this complication is to make it more similar to realistic situations where all you have is a Haskell wrapper around a FFI to a C struct
with OO-ish new
, set
, get
, operate
semantics (i.e. very much the imperative style). A typical Haskell program looks like this:
import LFSR
main = do
lfsr <- newLFSR -- make a LFSR object
setLFSR lfsr 42 -- initialise it with 42
stepLFSR lfsr -- do one update
getLFSR lfsr >>= print -- extract the new value and print
The default task is to calculate the average of the values (doubles) of 10'000'000 iterations of the LFSR. This task could be part of a suite of tests to analyse the 'randomness` of this stream of 16-bit integers.
0. Baseline
The baseline is the C implementation of the average over n
iterations:
double avg(state_t* s, int n) {
double sum = 0;
for(int i=0; i<n; i++, sum += step_get_lfsr(s));
return sum / (double)n;
}
The C implementation isn't meant to be particularly good, or fast. It just provides a meaningful computation.
1. Haskell lists
Compared to the C baseline, on this task Haskell lists are 73x slower.
=== RunAvg =========
Baseline: 1.874e-2
IO: 1.382488
factor: 73.77203842049093
This is the implementation (RunAvg.hs
):
step1 :: LFSR -> IO Word32
step1 lfsr = stepLFSR lfsr >> getLFSR lfsr
avg :: LFSR -> Int -> IO Double
avg lfsr n = mean <$> replicateM n (step1 lfsr) where
mean :: [Word32] -> Double
mean vs = (sum $ fromIntegral <$> vs) / (fromIntegral n)
2. Using the streaming
library
This gets us to within 9x of the baseline,
=== RunAvgStreaming ===
Baseline: 1.9391e-2
IO: 0.168126
factor: 8.670310969006241
(Note that the benchmarks are rather inaccurate at these short execution times.)
This is the implementation (RunAvgStreaming.hs
):
import qualified Streaming.Prelude as S
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = S.replicateM n (fromIntegral <$> step1 lfsr :: IO Double)
(mySum :> _) <- S.sum stream
return (mySum / fromIntegral n)
3. Using Data.Vector.Fusion.Stream.Monadic
This gives the best performance so far, within 3x of baseline,
=== RunVector =========
Baseline: 1.9986e-2
IO: 4.9146e-2
factor: 2.4590213149204443
As usual, here is the implementation (RunAvgVector.hs
):
import qualified Data.Vector.Fusion.Stream.Monadic as V
avg :: LFSR -> Int -> IO Double
avg lfsr n = do
let stream = V.replicateM n (step1' lfsr)
V.foldl (+) 0.0 stream
I didn't expect to find a good monadic stream implementation under Data.Vector
. Other than providing fromVector
and concatVectors
, Data.Vector.Fusion.Stream.Monadic
has very little to do with Vector
from Data.Vector
.
A look at the profiling report shows that Data.Vector.Fusion.Stream.Monadic
has a considerable space leak, but that doesn't sound right.
4. Lists aren't necessarily slow
For very simple operations lists aren't terrible at all:
=== RunRepeat =======
Baseline: 1.8078e-2
IO: 3.6253e-2
factor: 2.0053656377917912
Here, the for loop is done in Haskell instead of pushing it down to C (RunRepeat.hs
):
do
setLFSR lfsr 42
replicateM_ nIter (stepLFSR lfsr)
getLFSR lfsr
This is just a repetition of calls to stepLFSR
without passing the result back to the Haskell layer. It gives an indication of what impact the overhead for calling the wrapper and the FFI has.
Analysis
The repeat
example above shows that most, but not all (?), of the performance penalty comes from overhead of calling the wrapper and/or the FFI. But I'm not sure where to look for tweaks, now. Maybe this is just as good as it gets with regards to monadic streams, and in fact this is all about trimming down the FFI, now...
Sidenotes
- The fact that LFSRs are chosen as a toy problem does not imply that Haskell isn't able to do these efficiently - see the SO question "Efficient bit-fiddling in a LFSR implementation ".
- Iterating a 16-bit LFSR 10M times is a rather silly thing to do. It will take at most 2^16-1 iterations to reach the starting state again. In a maximum length LFSR it will take exactly 2^16-1 iterations.
Update 1
An attempt to remove the withForeignPtr
calls can be made by introducing a
Storable
and then using alloca :: Storable a => (Ptr a -> IO b) -> IO b
repeatSteps :: Word32 -> Int -> IO Word32
repeatSteps start n = alloca rep where
rep :: Ptr LFSRStruct' -> IO Word32
rep p = do
setLFSR2 p start
(sequence_ . (replicate n)) (stepLFSR2 p)
getLFSR2 p
where LFSRStruct'
is
data LFSRStruct' = LFSRStruct' CUInt
and the wrapper is
foreign import ccall unsafe "lfsr.h set_lfsr"
setLFSR2 :: Ptr LFSRStruct' -> Word32 -> IO ()
-- likewise for setLFSR2, stepLFSR2, ...
See RunRepeatAlloca.hs and src/LFSR.hs. Performance-wise this makes no difference (within timing variance).
=== RunRepeatAlloca =======
Baseline: 0.19811199999999998
IO: 0.33433
factor: 1.6875807623970283
void f(...)
functions? I.e. repackaging the getters? – Readershipvoid nothing() {}
it might be visible. OTOH calling an external procedure just for a side effect (not passing or receiving data) would make little sense even if stream fusion is achieved, so my point may be moot. – Townshend