Profiling a Haskell program
Asked Answered
A

4

24

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:

sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
  xs <- sequence (replicate n dist)
  return (sum xs)

Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.

I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.

Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.


Profiling

I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing

$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s

First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?

 1,377,538,232 bytes allocated in the heap
 1,195,050,032 bytes copied during GC
   169,411,368 bytes maximum residency (12 sample(s))
     7,360,232 bytes maximum slop
           423 MB total memory in use (0 MB lost due to fragmentation)

Generation 0:  2574 collections,     0 parallel,  2.40s,  2.43s elapsed
Generation 1:    12 collections,     0 parallel,  1.07s,  1.28s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    1.92s  (  1.94s elapsed)
GC    time    3.47s  (  3.70s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.23s  (  0.23s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.63s  (  5.87s elapsed)

%GC time      61.8%  (63.1% elapsed)

Alloc rate    716,368,278 bytes per MUT second

Productivity  34.2% of total user, 32.7% of total elapsed

Here are the results from

$ ./main +RTS -p

The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.

COST CENTRE           MODULE                no. entries  %time %alloc   %time %alloc

MAIN                  MAIN                    1        0   0.0    0.0   100.0  100.0
 main                 Main                  434        4   0.0    0.0   100.0  100.0
  likelihoodWeighting AI.Probability.Bayes  445        1   0.0    0.3   100.0  100.0
   distributionLW     AI.Probability.Bayes  448        1   0.0    2.6     0.0    2.6
   getSampleLW        AI.Probability.Bayes  446   100000  20.0   50.4   100.0   97.1
    bnProb            AI.Probability.Bayes  458   400000   0.0    0.0     0.0    0.0
    bnCond            AI.Probability.Bayes  457   400000   6.7    0.8     6.7    0.8
    bnVals            AI.Probability.Bayes  455   400000  20.0    6.3    26.7    7.1
     bnParents        AI.Probability.Bayes  456   400000   6.7    0.8     6.7    0.8
    bnSubRef          AI.Probability.Bayes  454   800000  13.3   13.5    13.3   13.5
    weightedSample    AI.Probability.Bayes  447   100000  26.7   23.9    33.3   25.3
     bnProb           AI.Probability.Bayes  453   100000   0.0    0.0     0.0    0.0
     bnCond           AI.Probability.Bayes  452   100000   0.0    0.2     0.0    0.2
     bnVals           AI.Probability.Bayes  450   100000   0.0    0.3     6.7    0.5
      bnParents       AI.Probability.Bayes  451   100000   6.7    0.2     6.7    0.2
     bnSubRef         AI.Probability.Bayes  449   200000   0.0    0.7     0.0    0.7

Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.

enter image description here

Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?

Austria answered 21/7, 2012 at 17:42 Comment(14)
Try using replicateM n instead of sequence . replicate nEakins
No change in the running time - possibly not that surprising, since replicateM n is defined as sequence . replicate n.Austria
length xs is always n, isn't it? So replace (length xs) with n and then xs doesn't have to stay entirely resident in memory at any time.Neurophysiology
@Neurophysiology that might also trigger list fusion optimizations, since sum is a list consumer and replicateM is a list generator. This would eliminate the need for extra strictness as well.Eakins
@Neurophysiology Good point, I'll edit the question. In my 'real' problem I'm estimating a distribution rather than just the mean - but I didn't want to clog up the question with unnecessary details...Austria
@Neurophysiology In case you're interested, the actual code is the function likelihoodWeighting in this Github repoAustria
Maybe I'm doing something wrong, but when I run your code, the list xs contains all the same number. I'm doing this: main = do { g <- getStdGen; a <- sampleMean 10000 (return $ fst $ randomR (0.0, 1.0) g); print a }Affected
@PeterHall you're passing in the same generator to each call to randomR, so it generates the same 'random' number each time. You could try something like (I've not tested this) main = do { a <- sampleMean 10000 (randomRIO (0.0,1.0)); print a }Austria
thanks Chris, I was being stupid! :)Affected
The profile is completely unhelpful, unfortunately. Did you compile with -prof -auto-all? If not, that may produce something more helpful, if yes, more manually inserted {-# SCC #-}s would be needed. All in all, it looks like your code is being too lazy indeed. If you could put up your testing main somewhere, I could take a look tomorrow.Raymonraymond
Thanks Daniel. I'm about to add some more detailed profiling info - I've made some efficiency improvements by memoizing some things that weren't being memoized but it's still overly lazy. Would appreciate any help you can give!Austria
@DanielFischer The code I've been profiling is here.Austria
Morally? Maybe a defined term in Haskell? Sorry. Ignorant gratuitous comment...Bloomer
It means "It's not exactly like this, it's a bit more complicated, but I'm simplifying it because the extra complication isn't relevant to the question I want to ask."Austria
R
14

A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.

A profiling run with the current source yields

probabilityIO              AI.Util.Util          26.1   42.4    413 290400000
weightedSample.go          AI.Probability.Bayes  16.1   19.1    255 131200080
bnParents                  AI.Probability.Bayes  10.8    1.2    171   8000384
bnVals                     AI.Probability.Bayes  10.4    7.8    164  53603072
bnCond                     AI.Probability.Bayes   7.9    1.2    125   8000384
ndSubRef                   AI.Util.Array          4.8    9.2     76  63204112
bnSubRef                   AI.Probability.Bayes   4.7    8.1     75  55203072
likelihoodWeighting.func   AI.Probability.Bayes   3.3    2.8     53  19195128
%!                         AI.Util.Util           3.3    0.5     53   3200000
bnProb                     AI.Probability.Bayes   2.5    0.0     40        16
bnProb.p                   AI.Probability.Bayes   2.5    3.5     40  24001152
likelihoodWeighting        AI.Probability.Bayes   2.5    2.9     39  20000264
likelihoodWeighting.func.x AI.Probability.Bayes   2.3    0.2     37   1600000

and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.

Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:

ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])

Reversing the list, and mapping (2^) over another list is inefficient, better is

ndSubRef = L.foldl' (\a d -> 2*a + d) 0

which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,

ndSubRef                   AI.Util.Array          1.7    1.3     24   8000384

in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.

Let's add a bit of strictness in weightedSample to see how that changes things:

weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
    go 1.0 (M.fromList fixed) (bnVars bn)
    where
        go w assignment []     = return (assignment, w)
        go w assignment (v:vs) = if v `elem` vars
            then
                let w' = w * bnProb bn assignment (v, fixed %! v)
                in go w' assignment vs
            else do
                let p = bnProb bn assignment (v,True)
                x <- probabilityIO p
                go w (M.insert v x assignment) vs

        vars = map fst fixed

The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:

go w assignment (v:vs) = if v `elem` vars
    then
        let !w' = w * bnProb bn assignment (v, fixed %! v)
        in go w' assignment vs
    else do
        let !p = bnProb bn assignment (v,True)
        x <- probabilityIO p
        let !assignment' = M.insert v x assignment
        go w assignment' vs

That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.

I see nothing else obvious to change there, so let's look at likelihoodWeighting:

func m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return $! x `seq` w `seq` M.adjust (+w) x m

In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':

func !m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return (M.insertWith' (+) x w m)

(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:

 934,566,488 bytes allocated in the heap
   1,441,744 bytes copied during GC
      68,112 bytes maximum residency (1 sample(s))
      23,272 bytes maximum slop
           1 MB total memory in use (0 MB lost due to fragmentation)

The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.

I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.

Raymonraymond answered 22/7, 2012 at 15:8 Comment(1)
Thanks Daniel, this is really useful. I've made your suggested changes and I'll look into your idea of getting rid of randomIO. I feel like I'm that much closer to writing a 'real' Haskell program now...Austria
F
7

I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.

My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.

Fotina answered 22/7, 2012 at 18:42 Comment(3)
+1 for proposing a less strict implementation instead of the regular advice of making it more strictBilk
Thanks, this is an interesting idea. Daniel Fischer suggested I get rid of my calls to randomIO, so I suppose that having my sampling functions return a Rand a and then calling them with evalRandIO is the way to go.Austria
@Chris: I've put our paper online; it has a few examples of computing with Rand a. We call evalRand only once at top level, or when forking a list of parallel computations.Fotina
C
4

I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.

Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):

  8 COST CENTRE                   MODULE               %time %alloc
  9 
 10 sample                        AI.Util.ProbDist      31.5   36.6
 11 bnParents                     AI.Probability.Bayes  23.2    0.0
 12 bnRank                        AI.Probability.Bayes  10.7   23.7
 13 weightedSample.go             AI.Probability.Bayes   9.6   13.4
 14 bnVars                        AI.Probability.Bayes   8.6   16.2
 15 likelihoodWeighting           AI.Probability.Bayes   3.8    4.2
 16 likelihoodWeighting.getSample AI.Probability.Bayes   2.1    0.7
 17 sample.cumulative             AI.Util.ProbDist       1.7    2.1
 18 bnCond                        AI.Probability.Bayes   1.6    0.0
 19 bnRank.ps                     AI.Probability.Bayes   1.1    0.0

And here are the summary statistics:

1,433,944,752 bytes allocated in the heap
 1,016,435,800 bytes copied during GC
   176,719,648 bytes maximum residency (11 sample(s))
     1,900,232 bytes maximum slop
           400 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.40s  (  1.41s elapsed)
GC      time    1.08s  (  1.24s elapsed)
Total   time    2.47s  (  2.65s elapsed)

%GC     time      43.6%  (46.8% elapsed)

Alloc rate    1,026,674,336 bytes per MUT second

Productivity  56.4% of total user, 52.6% of total elapsed

Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:

1,776,908,816 bytes allocated in the heap
  165,232,656 bytes copied during GC
   34,963,136 bytes maximum residency (7 sample(s))
      483,192 bytes maximum slop
           68 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.42s  (  2.44s elapsed)
GC      time    0.21s  (  0.23s elapsed)
Total   time    2.63s  (  2.68s elapsed)

%GC     time       7.9%  (8.8% elapsed)

Alloc rate    733,248,745 bytes per MUT second

Productivity  92.1% of total user, 90.4% of total elapsed

Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.

Cogon answered 22/7, 2012 at 0:12 Comment(6)
Thanks a lot! Frankly, I'm impressed that you read enough of my terrible code to work out how to put together an example...Austria
Quick question - I'm compiling my test script with -prof -auto-all but I only get the breakdown by cost centre if I add explicit {-# SCC #-} annotations. How did you get them all automatically? I tried compiling with -prof -fprof-auto but GCH doesn't recognize the second flag.Austria
You must be using an older GHC - I'm using 7.4.1, which supports these profiling options. 7.0.2 supports these; it looks like -auto-all ignores nested functions.Cogon
I must be doing something stupid - I'm installing the library with cabal install -p and then compiling my test file with ghc -rtsopts -prof -fprof-auto --fforce-recomp -O2 --make main.hs but I'm still not getting the detailed profile information. Anything obvious I'm doing wrong?Austria
Hmm. It looks to be fine, and that approach indeed spits out the desired output on my end. Not sure what's going on there?Cogon
FYI, I think I might have copied the wrong summary statistics into my answer. Checking it again, forcing the return in sample yields about a 1.5x speedup on both of our test cases, with the same high productivity.Cogon
D
4

I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.

The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as

{-# LANGUAGE BangPatterns #-}

sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist

for example, traversing the list only once.

You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Deanndeanna answered 22/7, 2012 at 2:23 Comment(3)
Your example doesn't typecheck. The step function should look more like liftM . (+). It could also be a bit more strict; testing with randomRIO tends to overflow stack with larger numbers, perhaps something like \a mb -> mb >>= \b -> return $! a + b?Bezique
Thanks, I'll ry rewriting likelihoodWeighting with a fold.Austria
@Bezique - thanks, fixed. I usually prefer to use bang patterns for adding strictness, but your suggestion is probably identical.Deanndeanna

© 2022 - 2024 — McMap. All rights reserved.