Reduce list on the fly in Haskell

R

3

8

Assume I have a function f which accepts some input and produce a number. Within the function f, a list is created according the input, which is then reduced (e.g. using foldl' g) to produce the final output number. Because the intermediate list is to be reduced after all, is it possible to apply the reduce function g without expressing the intermediate list. The goal here is to limit memory used for storing (or expressing, if 'storing' a less accurate word) the list.

To illustrate it, this function foldPairProduct takes O(N1 * N2) space for the intermediate list (the space consumed maybe more complicated due to expression and lazy evaluation, but I assume it is proportional or worse). Here N1, N2 are the size of the two input lists.

foldPairProduct :: (Num a, Ord a)  => (a -> a -> a) -> [a] -> [a] -> a
foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

An alternative implementation to the logic is foldPairProduct', which takes O(2 * 2) space.

foldPairProduct' :: Num a => (Maybe a -> Maybe a -> Maybe a) -> [a] -> [a] -> Maybe a  
foldPairProduct' _ _ [] = Nothing
foldPairProduct' _ [] _ = Nothing
foldPairProduct' f (x:xs) (y:ys) = 
  foldl1 f [Just $ x*y, foldPairProduct' f [x] ys, foldPairProduct' f xs [y], 
            foldPairProduct' f xs ys]

The situation is exacerbated for foldCrossProduct whose implementation is similar to foldPairProduct except that it accepts multiple lists as input. Its space complexity (still in imperative languages' sense) for the intermediate list is O(N1 * N2 * ...* Nk), where k being the length of [[a]].

foldCrossProduct :: Num a => (a -> a -> a) -> [[a]]  -> a
foldCrossProduct f xss = foldl1 f (crossProduct xss)

crossProduct :: Num a => [[a]] -> [a]
crossProduct [] = []
crossProduct (xs:[]) = xs
crossProduct (xs:xss) = [x * y | x <- xs, y <- crossProduct xss]

If we had followed the implementation idea of foldPairProduct', the space complexity would be k^2, which is much more space efficient. My questions are:

I implemented foldPairProduct' for a pair of list. However, seems like implementing it for arbitrary number of list is not straightforward.
I don't mean to compare Haskell to imperative language, but is there an implementation that would use constant space (or in another word, without expressing the intermediate list of the mentioned length)? Maybe Monad would help but I very new to it.
Does the compiler actually do its magic? That is, it notice the list is intermediate and to be reduced, and indeed figure out a way to evaluate it space-efficiently. After all, that is what I thought the lazy evaluation and compiler optimization is designed for.
Any comment is welcomed. Thanks.

Update 1

A performance test confirmed the analysis on 'space complexity' of foldPairProduct and foldCrossProduct, based on varying the input size N1, N2, N3, and on observing the Bytes Copied by GC.

The performance test dis-proof the analysis on foldPairProduct' which surprisingly showed N1 * N2 or even worse space usage. This is probably due to the recursive call is inefficiently evaluated. The results is attached below (with ghc settings same as what Yuras did).

Update 2

Updated some further experiment as I learn from the comments and answers. For foldPairProduct, the total memory in use is consistent with the space complexity Daniel Fischer's explained.

~~For foldCrossProduct, though Daniel's complexity analysis make sense to me, the results does not show a linear memory usage.~~ Following Daniel's advice, swapped the x <- xs and y <- crossproduct ys, and it indeed achieve a linear space complexity.

For foldCrossProduct (max) [[1..100],[1..n], [1..1000]], with n = 100, 1000, 10000, 100000, the memory used are 2, 2, 3, 14 MB.

foldPairProduct [1..n] [1..10000]

n = 100
  120,883,320 bytes allocated in the heap 
   56,867,728 bytes copied during GC
      428,384 bytes maximum residency (50 sample(s)) 
       98,664 bytes maximum slop
            3 MB total memory in use (0 MB lost due to fragmentation)     

n = 1000
 1,200,999,280 bytes allocated in the heap 
   569,837,360 bytes copied during GC   
       428,384 bytes maximum residency (500 sample(s))
        99,744 bytes maximum slop 
             3 MB total memory in use (0 MB lost due to fragmentation) 
n = 10000

  12,002,152,040 bytes allocated in the heap
   5,699,468,024 bytes copied during GC 
         428,384 bytes maximum residency (5000 sample(s))
          99,928 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 100000

 120,013,672,800 bytes allocated in the heap 
  56,997,625,608 bytes copied during GC 
         428,384 bytes maximum residency (50000 sample(s)) 
          99,984 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct [1..10000] [1..n]

n = 100

     121,438,536 bytes allocated in the heap 
          55,920 bytes copied during GC     
          32,408 bytes maximum residency (1 sample(s)) 
          19,856 bytes maximum slop  
               1 MB total memory in use (0 MB lost due to fragmentation)

n = 1000

   1,201,511,296 bytes allocated in the heap 
         491,864 bytes copied during GC     
          68,392 bytes maximum residency (1 sample(s)) 
          20,696 bytes maximum slop                   
               1 MB total memory in use (0 MB lost due to fragmentation)

n = 10000

  12,002,232,056 bytes allocated in the heap 
   5,712,004,584 bytes copied during GC     
         428,408 bytes maximum residency (5000 sample(s)) 
          98,688 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 100000

 120,009,432,816 bytes allocated in the heap
  81,694,557,064 bytes copied during GC 
       4,028,408 bytes maximum residency (10002 sample(s))
         769,720 bytes maximum slop 
              14 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct [1..n] [1..n]

n = 100
 1,284,024 bytes allocated in the heap
    15,440 bytes copied during GC
    32,336 bytes maximum residency (1 sample(s))
    19,920 bytes maximum slop                  
         1 MB total memory in use (0 MB lost due to fragmentation)  

n = 1000
 120,207,224 bytes allocated in the heap  
     114,848 bytes copied during GC 
      68,336 bytes maximum residency (1 sample(s)) 
      24,832 bytes maximum slop 
           1 MB total memory in use (0 MB lost due to fragmentation)  

n = 10000

  12,001,432,024 bytes allocated in the heap 
   5,708,472,592 bytes copied during GC 
         428,336 bytes maximum residency (5000 sample(s)) 
          99,960 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation) 

n = 100000
 1,200,013,672,824 bytes allocated in the heap 
   816,574,713,664 bytes copied during GC 
         4,028,336 bytes maximum residency (100002 sample(s)) 
           770,264 bytes maximum slop 
                14 MB total memory in use (0 MB lost due to fragmentation)

foldCrossProduct (max) [[1..n], [1..100], [1..1000]]

n = 100
     105,131,320 bytes allocated in the heap 
      38,697,432 bytes copied during GC     
         427,832 bytes maximum residency (34 sample(s)) 
         209,312 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
   1,041,254,480 bytes allocated in the heap 
     374,148,224 bytes copied during GC 
         427,832 bytes maximum residency (334 sample(s))
         211,936 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 10000
  10,402,479,240 bytes allocated in the heap 
   3,728,429,728 bytes copied during GC     
         427,832 bytes maximum residency (3334 sample(s))
         215,936 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

foldCrossProduct (max) [[1..100], [1..n], [1..1000]]

n = 100
     105,131,344 bytes allocated in the heap 
      38,686,648 bytes copied during GC  
         431,408 bytes maximum residency (34 sample(s)) 
         205,456 bytes maximum slop 
               3 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
   1,050,614,504 bytes allocated in the heap
     412,084,688 bytes copied during GC 
       4,031,456 bytes maximum residency (53 sample(s)) 
       1,403,976 bytes maximum slop
              15 MB total memory in use (0 MB lost due to fragmentation)    
n = 10000
    quit after over 1362 MB total memory in use (0 MB lost due to fragmentation)

foldPairProduct' [1..n] [1..n]

n = 100
 4,351,176 bytes allocated in the heap
    59,432 bytes copied during GC    
    74,296 bytes maximum residency (1 sample(s))
    21,320 bytes maximum slop                  
         1 MB total memory in use (0 MB lost due to fragmentation)

n = 1000
 527,009,960 bytes allocated in the heap 
  45,827,176 bytes copied during GC 
     211,680 bytes maximum residency (1 sample(s)) 
      25,760 bytes maximum slop 
           2 MB total memory in use (0 MB lost due to fragmentation)

Revamp answered 20/2, 2013 at 23:41 Comment(6)

I believe the purpose of haskell's laziness is that even when something is expressed as a list, the list isn't necessarily stored if it's not needed. – Vale 21/2, 2013 at 0:19

Did you actually test space consumption? – Brittaney 21/2, 2013 at 0:28

"observing the Bytes Copied by GC" Why so much attention a to number of bytes copied during GC? Bytes maximum residency or total memory in use should be more accurate for space usage measurements. Your program operates with immutable data, so a lot of allocations are ok. It may be an issue that GC copies a lot, but it is not related to memory usage. – Simonson 21/2, 2013 at 9:32

I found this paper about fusion for list hylomorphisms, that might be of theoretical interest. – Gutenberg 21/2, 2013 at 11:17

Copied bytes isn't that interesting. Turn on more statistics and check if the maximum heap in use grows. That's the number to worry about. – Brittaney 21/2, 2013 at 19:39

@phg the paper that you mentioned is right on the problem. – Revamp 25/2, 2013 at 6:22

V

2

foldPairProduct :: (Num a, Ord a)  => (a -> a -> a) -> [a] -> [a] -> a
foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

can be a good memory citizen. The second argument, ys is used repeatedly, so that will have to be in memory entirely during the computation, but the intermediate list is lazily produced as it is consumed, so that contributes only a constant amount of memory, giving an overall O(length ys) space complexity. Of course there have to be length xs * length ys list cells produces and consumed, so the overall allocations are O(length xs * length ys) [assuming each a value uses bounded space]. The number of bytes copied during GC (and thus the time needed for GC) can be drastically reduced by providing a larger allocation area, with +RTS -A1M, the number drops from

3,717,333,376 bytes copied during GC

for the default setting to

20,445,728 bytes copied during GC

and the time from GC time 4.88s to GC time 0.07s for xs == ys = [1 .. 10000] :: [Int] and f = (+).

But that depends on the strictness analyser doing its job - which it does fine if the type it's used at is e.g. Int and known during compilation, and the combining function is known to be strict. If the code is not specialised or if the combining function is not known to be strict, the fold will produce a thunk of O(length xs * length ys) size. That problem can be alleviated by using the stricter foldl1'.

foldPairProduct' :: Num a => (Maybe a -> Maybe a -> Maybe a) -> [a] -> [a] -> Maybe a  
foldPairProduct' _ _ [] = Nothing
foldPairProduct' _ [] _ = Nothing
foldPairProduct' f (x:xs) (y:ys) = 
  foldl1 f [Just $ x*y, foldPairProduct' f [x] ys, foldPairProduct' f xs [y], 
            foldPairProduct' f xs ys]

runs head-on into the problem of insufficient strictness, the value wrapped by a Just constructor can't be made strict by the compiler here, since it might not be needed for the overall result, so the fold often produces an O(length xs * length ys) size thunk under the Just - of course, for some f, like const, it will behave well as is. For that to be a good memory citizen if all values are used, you must use a sufficiently strict combining function f, forcing also the value under Just in the result (if it's a Just); using foldl1' also helps. With that, it can have O(length ys + length xs) space complexity (the lists xs and ys are used more than once, so are reused).

foldCrossProduct :: Num a => (a -> a -> a) -> [[a]]  -> a
foldCrossProduct f xss = foldl1 f (crossProduct xss)

crossProduct :: Num a => [[a]] -> [a]
crossProduct [] = []
crossProduct (xs:[]) = xs
crossProduct (xs:xss) = [x * y | x <- xs, y <- crossProduct xss]

Although GHC does little CSE (common subexpression elimination), the list crossProduct xss will (probably) be shared here between the different xs, so that produces O(N2*...*Nk) space complexity. If the order of elements in the list doesn't matter, reordering to

crossProduct (xs:xss) = [x * y | y <- crossProduct xss, x <- xs]

helps. Then crossProduct xss need not be in memory at once, so can be incrementally produced and consumed, only xs must be remembered because it's used multiple times. For the recursive invocations, the first of the remaining lists has to be shared, so that would produce an overall O(N1+...+Nk-1) space complexity.

Vivienne answered 21/2, 2013 at 12:44 Comment(6)

I now start understanding and appreciating lazy evaluation. Its combination with Strictness is powerful: high level programming with performance of lower level languages. Is the GHC's behavior on this code called "stream/loop fusion" and "deforestation", as mentioned in the other answers? – Revamp 25/2, 2013 at 0:7

I am wondering about the rules to allow for sufficient strictness. Because, often I would use customized type and functions. e.g. in the foldCrossProduct, f is customized max function, so that it may return both the max value and a vector of indices corresponding the max. – Revamp 25/2, 2013 at 0:11

No, that's still ordinary lazy evaluation. Stream fusion would then eliminate the allocation of list cells for the intermediate list. The ordinary lazy evaluation goes produce a cell, consume it, produce next, consume, ... and fusion would have the consumer get the list element directly instead of getting the cell that contains a pointer to the element. The rules for the appropriate strictness are rather difficult, since the correct strictness varies greatly with the situation. Too much strictness - or strictness in the wrong place - is as bad for speed and memory consumption as too much ... – Vivienne 25/2, 2013 at 11:45

laziness (or laziness in the wrong place). As a rule of thumb, you want laziness if the result can be incrementally produced, and strictness if the result can only be returned once it's complete. There are, however, exceptions. For your example, you certainly want f strict in the maximum value, for the indices, it's not so clear cut, but I see more possible traps if they are treated lazily. – Vivienne 25/2, 2013 at 11:47

Lazy evaluation, a double-edged sword? – Revamp 26/2, 2013 at 0:8

At least. It takes a bit of experience until you can juggle it without cutting yourself too often. – Vivienne 26/2, 2013 at 0:13

S

4

(Ok, I was wrong, it will not work in constant space because one of the lists is used multiple times, so it most likely to have linear space complexity)

Did you try to compile test program with optimizations enabled? Your foldPairProduct looks good for me, and I expect it to work in constant space.

ADD: Yes, it works in constant space (3 MB total memory in use):

shum@shum-laptop:/tmp/shum$ cat test.hs 

foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

n :: Int
n = 10000

main = print $ foldPairProduct (+) [1..n] [1..n]
shum@shum-laptop:/tmp/shum$ ghc --make -fforce-recomp -O test.hs 
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
shum@shum-laptop:/tmp/shum$ time ./test +RTS -s
2500500025000000
  10,401,332,232 bytes allocated in the heap
   3,717,333,376 bytes copied during GC
         428,280 bytes maximum residency (3335 sample(s))
         219,792 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     16699 colls,     0 par    4.27s    4.40s     0.0003s    0.0009s
  Gen  1      3335 colls,     0 par    1.52s    1.52s     0.0005s    0.0012s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.23s  (  2.17s elapsed)
  GC      time    5.79s  (  5.91s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.02s  (  8.08s elapsed)

  %GC     time      72.2%  (73.2% elapsed)

  Alloc rate    4,659,775,665 bytes per MUT second

  Productivity  27.8% of total user, 27.6% of total elapsed


real    0m8.085s
user    0m8.025s
sys 0m0.040s
shum@shum-laptop:/tmp/shum$

Simonson answered 21/2, 2013 at 0:13 Comment(3)

Specifically, it uses the second argument multiple times. For n = 10000000, foldPairProduct (+) [1..10] [1..n] uses 1362MB of memory, whereas foldPairProduct (+) [1..n] [1..10] uses 1MB of memory. – Zingaro 21/2, 2013 at 1:0

@Zingaro I wonder whether it is possible to completely prevent sharing – Simonson 21/2, 2013 at 1:24

I don't know of an easy way to do this. This question on thunk duplication might be of interest. That said, doing so would involve recomputing the list every time, so you are making a potentially large time vs space trade off. – Zingaro 21/2, 2013 at 1:31

S

4

There is specific optimization for creation/modifciation/consumption of lists called loop fusion. Because Haskell is pure and non-strict there is number of laws like map f . mag g == map (f . g).

If the compiler for some reason would not recognize the code and produce sub-optimal code (after passing -O flag) I would look into stream fusion in detail to see what is preventing it.

Sanchez answered 21/2, 2013 at 0:35 Comment(0)

V

2

foldPairProduct :: (Num a, Ord a)  => (a -> a -> a) -> [a] -> [a] -> a
foldPairProduct f xs ys = foldl1 f [ x*y | x <- xs, y <- ys]

can be a good memory citizen. The second argument, ys is used repeatedly, so that will have to be in memory entirely during the computation, but the intermediate list is lazily produced as it is consumed, so that contributes only a constant amount of memory, giving an overall O(length ys) space complexity. Of course there have to be length xs * length ys list cells produces and consumed, so the overall allocations are O(length xs * length ys) [assuming each a value uses bounded space]. The number of bytes copied during GC (and thus the time needed for GC) can be drastically reduced by providing a larger allocation area, with +RTS -A1M, the number drops from

3,717,333,376 bytes copied during GC

for the default setting to

20,445,728 bytes copied during GC

and the time from GC time 4.88s to GC time 0.07s for xs == ys = [1 .. 10000] :: [Int] and f = (+).

But that depends on the strictness analyser doing its job - which it does fine if the type it's used at is e.g. Int and known during compilation, and the combining function is known to be strict. If the code is not specialised or if the combining function is not known to be strict, the fold will produce a thunk of O(length xs * length ys) size. That problem can be alleviated by using the stricter foldl1'.

foldPairProduct' :: Num a => (Maybe a -> Maybe a -> Maybe a) -> [a] -> [a] -> Maybe a  
foldPairProduct' _ _ [] = Nothing
foldPairProduct' _ [] _ = Nothing
foldPairProduct' f (x:xs) (y:ys) = 
  foldl1 f [Just $ x*y, foldPairProduct' f [x] ys, foldPairProduct' f xs [y], 
            foldPairProduct' f xs ys]

runs head-on into the problem of insufficient strictness, the value wrapped by a Just constructor can't be made strict by the compiler here, since it might not be needed for the overall result, so the fold often produces an O(length xs * length ys) size thunk under the Just - of course, for some f, like const, it will behave well as is. For that to be a good memory citizen if all values are used, you must use a sufficiently strict combining function f, forcing also the value under Just in the result (if it's a Just); using foldl1' also helps. With that, it can have O(length ys + length xs) space complexity (the lists xs and ys are used more than once, so are reused).

foldCrossProduct :: Num a => (a -> a -> a) -> [[a]]  -> a
foldCrossProduct f xss = foldl1 f (crossProduct xss)

crossProduct :: Num a => [[a]] -> [a]
crossProduct [] = []
crossProduct (xs:[]) = xs
crossProduct (xs:xss) = [x * y | x <- xs, y <- crossProduct xss]

Although GHC does little CSE (common subexpression elimination), the list crossProduct xss will (probably) be shared here between the different xs, so that produces O(N2*...*Nk) space complexity. If the order of elements in the list doesn't matter, reordering to

crossProduct (xs:xss) = [x * y | y <- crossProduct xss, x <- xs]

helps. Then crossProduct xss need not be in memory at once, so can be incrementally produced and consumed, only xs must be remembered because it's used multiple times. For the recursive invocations, the first of the remaining lists has to be shared, so that would produce an overall O(N1+...+Nk-1) space complexity.

Vivienne answered 21/2, 2013 at 12:44 Comment(6)

I now start understanding and appreciating lazy evaluation. Its combination with Strictness is powerful: high level programming with performance of lower level languages. Is the GHC's behavior on this code called "stream/loop fusion" and "deforestation", as mentioned in the other answers? – Revamp 25/2, 2013 at 0:7

I am wondering about the rules to allow for sufficient strictness. Because, often I would use customized type and functions. e.g. in the foldCrossProduct, f is customized max function, so that it may return both the max value and a vector of indices corresponding the max. – Revamp 25/2, 2013 at 0:11

No, that's still ordinary lazy evaluation. Stream fusion would then eliminate the allocation of list cells for the intermediate list. The ordinary lazy evaluation goes produce a cell, consume it, produce next, consume, ... and fusion would have the consumer get the list element directly instead of getting the cell that contains a pointer to the element. The rules for the appropriate strictness are rather difficult, since the correct strictness varies greatly with the situation. Too much strictness - or strictness in the wrong place - is as bad for speed and memory consumption as too much ... – Vivienne 25/2, 2013 at 11:45

laziness (or laziness in the wrong place). As a rule of thumb, you want laziness if the result can be incrementally produced, and strictness if the result can only be returned once it's complete. There are, however, exceptions. For your example, you certainly want f strict in the maximum value, for the indices, it's not so clear cut, but I see more possible traps if they are treated lazily. – Vivienne 25/2, 2013 at 11:47

Lazy evaluation, a double-edged sword? – Revamp 26/2, 2013 at 0:8

At least. It takes a bit of experience until you can juggle it without cutting yourself too often. – Vivienne 26/2, 2013 at 0:13

Recommended topics

Hot tags