Code becomes slower as more boxed arrays are allocated
Asked Answered
D

2

24

Edit: It turns out that things generally (not just array/ref operations) slow down the more arrays have been created, so I guess this might just be measuring increased GC times and might not be as strange as I thought. But I'd really like to know (and learn how to find out) what's happening here though, and if there's some way to mitigate this effect in code that creates lots of smallish arrays. Original question follows.


In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.

Here's the first example:

module Main
    where 

import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)

main = do 
  let n = 100000
  allTheArrays <- newIORef []
  defaultMain $
    [ bench "array creation" $ do
         newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
         atomicModifyIORef'  allTheArrays (\l-> (newArr:l,()))
    ]

We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:

slowdown

Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.

main = do 
  let n = 1000000
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  -- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
  arrsRef <- newIORef arrs
  defaultMain $
    [ bench "atomic-mods of IORef" $
    -- nfIO $           -- OR THIS ALSO WORKS
        replicateM 1000 $
          atomicModifyIORef' arrsRef (\(a:as)-> (as,()))
    ]

enter image description here

Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).

Questions

  • What's happening here?
  • Is it expected behavior?
  • Is there a way I can avoid this slowdown?

Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.


Bonus example

Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.

import Control.Monad
import System.Environment

import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)

import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V

import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
  [n] <- fmap (map read) getArgs
  arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
  arrsRef <- newIORef arrs

  t0 <- getCPUTimeDouble

  cnt <- newIORef (0::Int)
  replicateM_ 1000000 $
    (atomicModifyIORef' cnt (\n-> (n+1,())) >>= evaluate)

  t1 <- getCPUTimeDouble

  -- make sure these stick around
  readIORef cnt >>= print
  readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print

  putStrLn "The time:"
  print (t1 - t0)

A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.

If you want to reproduce, here is the cabal file I've been using

name:                small-concurrency-benchmarks
version:             0.1.0.0       
build-type:          Simple
cabal-version:       >=1.10

executable small-concurrency-benchmarks
  main-is:             Main.hs 
  build-depends:       base >=4.6
                     , criterion
                     , primitive

  default-language:    Haskell2010
  ghc-options: -O2  -rtsopts

Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.

main4 :: IO ()
main4= do
  [n] <- fmap (map read) getArgs
  let ns = [(1::Integer).. n]
  arrsRef <- newIORef ns
  print $ length ns

  t0 <- getCPUTimeDouble
  mapM (evaluate . sum) (tails [1.. 10000])
  t1 <- getCPUTimeDouble

  readIORef arrsRef >>= (print . sum)

  print (t1 - t0)

Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.

         Baseline  20M   200M
Lists:   0.7       1.0    4.4
Arrays:  0.7       2.6   20.4

Conclusions (WIP)

  1. This is most likely due to GC behavior

  2. But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.

  3. The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):

       Array size   Array create time  Time for "payload":
       8            3.164           14.264
       16           1.532           9.008
       32           1.208           6.668
       64           0.644           3.78
       128          0.528           2.052
       256          0.444           3.08
       512          0.336           4.648
       1024         0.356           0.652
    

    And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.

  4. From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.

Docker answered 4/5, 2014 at 21:52 Comment(0)
W
6

You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html

Wiggs answered 9/5, 2014 at 6:8 Comment(0)
T
4

I think you're definitely seeing GC effects. I had a related issue in cassava (https://github.com/tibbe/cassava/issues/49#issuecomment-34929984) where the GC time was increasing linearly with increasing heap size.

Try to measure how the GC time and mutator time increase as you hold on to more and more arrays in memory.

You can reduce GC time with playing with the +RTS options. For example, try setting -A to your L3 cache size.

Taliesin answered 6/5, 2014 at 14:2 Comment(3)
Thanks for the input! I'll be looking into that. One thing I just looked at that was interesting: if I profile a variation that uses [Integer] as garbage to fill the heap then I find the timing is affected much less by the heap size (see edit on bottom).Docker
...and for me, running with +RTS -A200M makes the difference between the list and array garbage versions disappear (when comparing runs with approximate identical max heap usage).Docker
@Docker setting -A high enough makes the GC never run, but don't rely on those numbers for real life apps. In a high -A pushes the next GC far enough in the future that it beyond the runtime of the benchmark.Taliesin

© 2022 - 2024 — McMap. All rights reserved.