I'm trying to work over big files using Haskell. I'd like to browse an input file byte after byte, and to generate an output byte after byte. Of course I need the IO to be buffered with blocks of reasonable size (a few KB). I can't do it, and I need your help please.
import System
import qualified Data.ByteString.Lazy as BL
import Data.Word
import Data.List
main :: IO ()
main =
do
args <- System.getArgs
let filename = head args
byteString <- BL.readFile filename
let wordsList = BL.unpack byteString
let foldFun acc word = doSomeStuff word : acc
let wordsListCopy = foldl' foldFun [] wordsList
let byteStringCopy = BL.pack (reverse wordsListCopy)
BL.writeFile (filename ++ ".cpy") byteStringCopy
where
doSomeStuff = id
I name this file TestCopy.hs
, then do the following:
$ ls -l *MB
-rwxrwxrwx 1 root root 10000000 2011-03-24 13:11 10MB
-rwxrwxrwx 1 root root 5000000 2011-03-24 13:31 5MB
$ ghc --make -O TestCopy.hs
[1 of 1] Compiling Main ( TestCopy.hs, TestCopy.o )
Linking TestCopy ...
$ time ./TestCopy 5MB
real 0m5.631s
user 0m1.972s
sys 0m2.488s
$ diff 5MB 5MB.cpy
$ time ./TestCopy 10MB
real 3m6.671s
user 0m3.404s
sys 1m21.649s
$ diff 10MB 10MB.cpy
$ time ./TestCopy 10MB +RTS -K500M -RTS
real 2m50.261s
user 0m3.808s
sys 1m13.849s
$ diff 10MB 10MB.cpy
$
My problem: There is a huge difference between a 5MB and a 10 MB file. I'd like the performances to be linear in the size of the input file. Please what am i doing wrong, and how can I achieve this? I don't mind using lazy bytestrings or anything else as long as it works, but it has to be a standard ghc library.
Precision: It's for a university project. And I'm not trying to copy files. The doSomeStuff
function shall perform compression/decompression actions that I have to customize.
pack
andunpack
are very costly operations. Can't youdoSomeStuff
directly with ByteString? NB: lazy ByteString is 'buffered' internally which might be adequate for your task – LeapBL.cons
acc let byteStringCopy = BL.foldl' foldFun BL.empty byteString BL.writeFile (filename ++ ".cpy") (byteStringCopy) where doSomeStuff = id – WeslaByteString.cons
is very costly too :) (it does memcpy internally and here it's executed repeatedly for every byte in your file). So you should avoidcons
as well. Here you just copy the entire file - try writingByteString
that will show you the actual (maximum) speed of lazy ByteString reading/writing.What actual transformation do you intend to do with the content of the file? – LeapByteString.mapAccumL
to code your bitwise operations btw. – Leapcabal unpack blaze-builder
-- now you have the source and can move any files you need directly into your repo. – Fablan