Haskell Lazy ByteString + read/write progress function

Asked 12/7, 2011 at 17:51 Answered 7/11, 2011 at 12:41

Solved haskell io progress lazy-evaluation bytestring

I am learing Haskell Lazy IO.

I am looking for an elegant way to copy a large file (8Gb) while printing copy progress to console.

Consider the following simple program that copies a file silently.

module Main where

import System
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          body <- B.readFile from
          B.writeFile to body

Imgine there is a callback function you want to use for reporting:

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

QUESTION: how to weave onReadBytes function into Lazy ByteString so it will be called back on successfull read? Or if this design is not good, then what is the Haskell way to do it?

NOTE: the frequency of callback is not important, it can be called every 1024 bytes or every 1 Mb -- not important

ANSWER: Many thanks to camccann for the answer. I suggest to read it entirely.

Bellow is my version of the code based on camccann's code, you may find it useful.

module Main where

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          withFile from ReadMode $ \fromH ->
            withFile to WriteMode $ \toH ->
              copyH fromH toH $ \x -> putStrLn $ "Bytes copied: " ++ show x

copyH :: Handle -> Handle -> (Integer -> IO()) -> IO ()
copyH fromH toH onProgress =
    copy (B.hGet fromH (256 * 1024)) (write toH) B.null onProgress
    where write o x  = do B.hPut o x
                          return . fromIntegral $ B.length x

copy :: (Monad m) => m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy = copy_ 0

copy_ :: (Monad m) => Integer -> m a -> (a -> m Integer) -> (a -> Bool) -> (Integer -> m()) -> m()
copy_ count inp outp done onProgress = do x <- inp
                                          unless (done x) $
                                            do n <- outp x
                                               onProgress (n + count)
                                               copy_ (n + count) inp outp done onProgress

Exophthalmos answered 12/7, 2011 at 17:51 Comment(3)

btw, you can equivalently say [from, to] <- getArgs – Blur 12/7, 2011 at 18:5

As a similar question - is there a way to do progress on a pure computation? – Battled 12/7, 2011 at 19:1

@monadic, its a similar question, but the answer is rather different. Basically you have your function return a pair containing both the result you want and the progress reports, with the progress reports "computed" in such a way that they depend on the computation of the result. Then print the progress reports before using the results. If you want more details then I suggest asking a new question. – Quin 12/7, 2011 at 19:52

First, I'd like to note that a fair number of Haskell programmers regard lazy IO in general with some suspicion. It technically violates purity, but in a limited way that (as far as I'm aware) isn't noticeable when running a single program on consistent input^[0]. On the other hand, plenty of people are fine with it, again because it involves only a very restricted kind of impurity.

To create the illusion of a lazy data structure that's actually created with on-demand I/O, functions like readFile are implemented using sneaky shenanigans behind the scenes. Weaving in the on-demand I/O is inherent to the function, and it's not really extensible for pretty much the same reasons that the illusion of getting a regular ByteString from it is convincing.

Handwaving the details and writing pseudocode, something like readFile basically works like this:

lazyInput inp = lazyIO (lazyInput' inp)
lazyInput' inp = do x <- readFrom inp
                    if (endOfInput inp)
                        then return []
                        else do xs <- lazyInput inp
                                return (x:xs)

...where each time lazyIO is called, it defers the I/O until the value is actually used. To invoke your reporting function each time the actual read occurs, you'd need to weave it in directly, and while a generalized version of such a function could be written, to my knowledge none exist.

Given the above, you have a few options:

Look up the implementation of the lazy I/O functions you're using, and implement your own that include the progress reporting function. If this feels like a dirty hack, that's because it pretty much is, but there you go.
Abandon lazy I/O and switch to something more explicit and composable. This is the direction that the Haskell community as a whole seems to be heading in, specifically toward some variation on Iteratees, which give you nicely composable little stream processor building blocks that have more predictable behavior. The downside is that the concept is still under active development so there's no consensus on implementation or single starting point for learning to use them.
Abandon lazy I/O and switch to plain old regular I/O: Write an IO action that reads a chunk, prints the reporting info, and processes as much input as it can; then invoke it in a loop until done. Depending on what you're doing with the input and how much you're relying on laziness in your processing, this could involve anything from writing a couple nearly-trivial functions to building a bunch of finite-state-machine stream processors and getting 90% of the way to reinventing Iteratees.

[0]: The underlying function here is called unsafeInterleaveIO, and to the best of my knowledge the only ways to observe impurity from it require either running the program on different input (in which case it's entitled to behave differently anyhow, it just may be doing so in ways that don't make sense in pure code), or changing the code in certain ways (i.e., refactorings that should have no effect can have non-local effects).

Here's a rough example of doing things the "plain old regular I/O" way, using more composable functions:

import System
import System.IO
import qualified Data.ByteString.Lazy as B

main = do [from, to] <- getArgs
          -- withFile closes the handle for us after the action completes
          withFile from ReadMode $ \inH ->
            withFile to WriteMode $ \outH ->
                -- run the loop with the appropriate actions
                runloop (B.hGet inH 128) (processBytes outH) B.null

-- note the very generic type; this is useful, because it proves that the
-- runloop function can only execute what it's given, not do anything else
-- behind our backs.
runloop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
runloop inp outp done = do x <- inp
                           if done x
                             then return ()
                             else do outp x
                                     runloop inp outp done

-- write the output and report progress to stdout. note that this can be easily
-- modified, or composed with other output functions.
processBytes :: Handle -> B.ByteString -> IO ()
processBytes h bs | B.null bs = return ()
                  | otherwise = do onReadBytes (fromIntegral $ B.length bs)
                                   B.hPut h bs

onReadBytes :: Integer -> IO ()
onReadBytes count = putStrLn $ "Bytes read: " ++ (show count)

The "128" up there is how many bytes to read at a time. Running this on a random source file in my "Stack Overflow snippets" directory:

$ runhaskell ReadBStr.hs Corec.hs temp
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 128
Bytes read: 83
$

Spearmint answered 12/7, 2011 at 18:49 Comment(6)

This is an amazing answer. I'd heard lots of complaints about lazy IO, but never really understood them; this explanation clearly indicates why a thing that appears nicely compositional in fact is basically broken. – Townie 12/7, 2011 at 19:5

@Daniel Wagner: Lazy I/O as it exists currently is highly compositional in exactly one way: composing pure processor functions and "naive" output actions directly with a single input action. When the output needs to be aware of how reading is done, when inputs need to be merged, or when processor functions need to do I/O, it gets clumsy pretty quickly. – Spearmint 12/7, 2011 at 19:11

@oshyshko: Added an example. It's a bit longer than your current code (as expected) but you can hopefully see how the approach is easier to extend. – Spearmint 12/7, 2011 at 19:52

A little question: why do you check B.null twice? ("if done x" and "processBytes ") It seems that "if done x" will never allow empty ByteString into "outp x". – Exophthalmos 12/7, 2011 at 20:53

@oshyshko: Two reasons; the first is that, since I wanted to demonstrate how this makes things more modular/composable, I added that because processBytes could be reused in a context where it might receive an empty ByteString and I wanted it to take no action in that case. The more accurate reason is that I needed another cup of coffee and wasn't really thinking clearly. – Spearmint 12/7, 2011 at 21:6

Though I will add that, if a function can correctly handle all possible inputs, it's worth doing so even when you expect it to only receive a limited subset of them. Makes things less likely to break if you change the program elsewhere in a way that alters that expectation. – Spearmint 12/7, 2011 at 21:9

Use Data.ByteString.Lazy.Progress. It allows you to print all kinds of metrics as data passes through.

Phocaea answered 7/11, 2011 at 12:41 Comment(3)

Do you have any example how to use this package ? – Cabretta 8/5, 2012 at 13:12

@jinkou2jinkou2: I did manage to get it work, despite my poor haskell skills. Try with something to the effect of trackProgressString "%b" Nothing printLn flowMonad, where flowMonad is the function that does the IO. – Phocaea 9/5, 2012 at 7:11

thanks. I ask also to the maintainer, who kindly added a perfect working example to the package. – Cabretta 12/5, 2012 at 6:36

Recommended topics

Hot tags