I use haskell for line-based data processing, i.e. tasks where you can apply sed
, awk
and similar tools. As a trivial example, let's prepend 000
to every line from standard input.
I have three alternative ways to do the task:
- Lazy IO with lazy
ByteString
s - Line-based conduit.
- Chunk-based conduit with pure strict
ByteString
processing inside.
example.hs
:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
import ClassyPrelude.Conduit
import qualified Data.ByteString.Char8 as B8
import qualified Data.ByteString.Lazy.Char8 as BL8
import qualified Data.Conduit.Binary as CB
main = do
[arg] <- getArgs
case arg of
"lazy" -> BL8.getContents >>= BL8.putStr . BL8.unlines . map ("000" ++) . BL8.lines
"lines" -> runConduitRes $ stdinC .| CB.lines .|
mapC ("000" ++) .| mapC (`snoc` 10) .| stdoutC
"chunks" -> runConduitRes $ stdinC .| lineChunksC .|
mapC (B8.unlines . (map ("000" ++)) . B8.lines) .| stdoutC
lineChunksC :: Monad m => Conduit ByteString m ByteString
lineChunksC = await >>= maybe (return ()) go
where
go acc = if
| Just (_, 10) <- unsnoc acc -> yield acc >> lineChunksC
| otherwise -> await >>= maybe (yield acc) (go' . breakAfterEOL)
where
go' (this, next) = let acc' = acc ++ this in if null next then go acc' else yield acc' >> go next
breakAfterEOL :: ByteString -> (ByteString, ByteString)
breakAfterEOL = uncurry (\x -> maybe (x, "") (first (snoc x)) . uncons) . break (== 10)
$ stack ghc --package={classy-prelude-conduit,conduit-extra} -- -O2 example.hs -o example $ for cmd in lazy lines chunks; do echo $cmd; time -p seq 10000000 | ./example $cmd > /dev/null; echo; done lazy real 2.99 user 3.06 sys 0.07 lines real 3.30 user 3.36 sys 0.06 chunks real 1.83 user 1.95 sys 0.06
(The results are consistent across multiple runs, and also hold for lines with several numbers).
So chunks
is 1.6x faster than lines
which is a bit faster than lazy
. This means that conduits can be faster than plain bytestrings, but the overhead of conduit pipes is too heavy when you split chunks into short lines.
What I don't like about chunks
approach is that it mixes both conduit and pure worlds, and it makes it harder to use it for more complex tasks.
The question is, did I miss a simple and elegant solution which would allow me to write efficient code in same fashion as with lines
approach?
EDIT1: Per @Michael's suggestion I joined two mapC
into one mapC (("000" ++). (
snoc10))
in lines
solution, to make number of pipes (.|
) same between lines
and chunks
. This made it perform a bit better (down from 3.3s to 2.8s), but still significantly slower than chunks
.
Also I tried older Conduit.Binary.lines
which Michael suggested in comments, and it also improves performance a bit, by ~0.1s.
EDIT2: Fixed lineChunksC
so it works with very small chunks, e.g.
> runConduitPure $ yield ("\nr\n\n"::ByteString) .| concatC .| mapC singleton .| lineChunksC .| sinkList
["\n","r\n","\n"]
mapC (`snoc` 10)
? – Josey'\n'
, so it appends newline character back to each line. – GravestdinC' = (stdinC :: Source (ResourceT IO) B.ByteString) .| concatC .| mapC B.singleton
This doesn't affect the semantics ofstdinC
but it breaks this function, but not the conduit lines function. – Fortdefrancechunks
function allocates as many intermediate bytestrings as there are bytes in the line. It is rational where you expect one or two chunks per line. This is fairly typical of course. – Fortdefrancelines
had this property; see this patch github.com/snoyberg/conduit/pull/209 It might be that the old version was faster for inputs like yours, where we have a ton of one-chunk short lines. – FortdefranceCB.lines
performs better for me (speeds the run by ~5%), but most of time is spent in conduit pipeline anyway (fuse,>>=
). – GravelineChunksC
? It's quite likely, I just wrote for thechunks
solution and never tested on anything else. – GravemapC (("000" ++). (`snoc` 10))
instead ofmapC ("000" ++) .| mapC (`snoc` 10)
. This is done implicitly by the other. – Fortdefrance.|
makes for a more complex program - unless they fuse away. Yourchunks
uses three, but yourlines
uses four. This has nothing to do with the lines function. – FortdefrancemapC
improves performance, but not drastically, so there is still serious performance loss with per-line conduit processing. – GravelineChunksC
it will be impossible to bench this. Try it with, e.g.yield ("\nr\n\n"::B.ByteString) =$= concatC =$= mapC B.singleton
instead ofstdinC
. This is essential to benchmarking. – FortdefranceByteString.lines
where possible - but as expected, if I feed it a file that is 10 million 't's with no linebreaks, it is much slower sprunge.us/APeK If I give the file in single byte chunks char by char, it doesn't complete after 5 minutes, since it is allocating 10M separate bytestrings, while the other is done in 5 sec. It is I think quadratic in the number of chunks per line. Here's my source sprunge.us/gVQG – Fortdefrance