Ways to improve performance for line-based conduits
Asked Answered
G

1

8

I use haskell for line-based data processing, i.e. tasks where you can apply sed, awk and similar tools. As a trivial example, let's prepend 000 to every line from standard input.

I have three alternative ways to do the task:

  1. Lazy IO with lazy ByteStrings
  2. Line-based conduit.
  3. Chunk-based conduit with pure strict ByteString processing inside.

example.hs:

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}

import ClassyPrelude.Conduit
import qualified Data.ByteString.Char8 as B8
import qualified Data.ByteString.Lazy.Char8 as BL8
import qualified Data.Conduit.Binary as CB

main = do
  [arg] <- getArgs
  case arg of

    "lazy" -> BL8.getContents >>= BL8.putStr . BL8.unlines . map ("000" ++) . BL8.lines

    "lines" -> runConduitRes $ stdinC .| CB.lines .|
      mapC ("000" ++) .| mapC (`snoc` 10) .| stdoutC

    "chunks" -> runConduitRes $ stdinC .| lineChunksC .|
      mapC (B8.unlines . (map ("000" ++)) . B8.lines) .| stdoutC


lineChunksC :: Monad m => Conduit ByteString m ByteString
lineChunksC = await >>= maybe (return ()) go
  where
  go acc = if
    | Just (_, 10) <- unsnoc acc -> yield acc >> lineChunksC
    | otherwise -> await >>= maybe (yield acc) (go' . breakAfterEOL)
    where
    go' (this, next) = let acc' = acc ++ this in if null next then go acc' else yield acc' >> go next

breakAfterEOL :: ByteString -> (ByteString, ByteString)
breakAfterEOL = uncurry (\x -> maybe (x, "") (first (snoc x)) . uncons) . break (== 10)
$ stack ghc --package={classy-prelude-conduit,conduit-extra} -- -O2 example.hs -o example
$ for cmd in lazy lines chunks; do echo $cmd; time -p seq 10000000 | ./example $cmd > /dev/null; echo; done
lazy
real 2.99
user 3.06
sys 0.07

lines
real 3.30
user 3.36
sys 0.06

chunks
real 1.83
user 1.95
sys 0.06

(The results are consistent across multiple runs, and also hold for lines with several numbers).

So chunks is 1.6x faster than lines which is a bit faster than lazy. This means that conduits can be faster than plain bytestrings, but the overhead of conduit pipes is too heavy when you split chunks into short lines.

What I don't like about chunks approach is that it mixes both conduit and pure worlds, and it makes it harder to use it for more complex tasks.

The question is, did I miss a simple and elegant solution which would allow me to write efficient code in same fashion as with lines approach?

EDIT1: Per @Michael's suggestion I joined two mapC into one mapC (("000" ++). (snoc10)) in lines solution, to make number of pipes (.|) same between lines and chunks. This made it perform a bit better (down from 3.3s to 2.8s), but still significantly slower than chunks.

Also I tried older Conduit.Binary.lines which Michael suggested in comments, and it also improves performance a bit, by ~0.1s.

EDIT2: Fixed lineChunksC so it works with very small chunks, e.g.

> runConduitPure $ yield ("\nr\n\n"::ByteString) .| concatC .| mapC singleton .| lineChunksC .| sinkList 
["\n","r\n","\n"]
Grave answered 29/10, 2016 at 13:46 Comment(15)
What's the purpose of mapC (`snoc` 10)?Josey
@Josey 10 is Word8 value for '\n', so it appends newline character back to each line.Grave
This has a bug, for starters. Suppose I break the source up into one-byte bytestrings stdinC' = (stdinC :: Source (ResourceT IO) B.ByteString) .| concatC .| mapC B.singleton This doesn't affect the semantics of stdinC but it breaks this function, but not the conduit lines function.Fortdefrance
When it is corrected, it will emerge that in gluing the one-byte bytestrings together into lines, your chunks function allocates as many intermediate bytestrings as there are bytes in the line. It is rational where you expect one or two chunks per line. This is fairly typical of course.Fortdefrance
It occurs to me the old conduit lines had this property; see this patch github.com/snoyberg/conduit/pull/209 It might be that the old version was faster for inputs like yours, where we have a ton of one-chunk short lines.Fortdefrance
@Fortdefrance it seems like the old CB.lines performs better for me (speeds the run by ~5%), but most of time is spent in conduit pipeline anyway (fuse, >>=).Grave
@Fortdefrance I guess you mean the bug is in my lineChunksC? It's quite likely, I just wrote for the chunks solution and never tested on anything else.Grave
Yes, if it is given the bytes in tiny chunks it exits after two lines.Fortdefrance
The regular conduit function speeds up if you assimilate the maps writing mapC (("000" ++). (`snoc` 10)) instead of mapC ("000" ++) .| mapC (`snoc` 10). This is done implicitly by the other.Fortdefrance
There is a crude rule of course that more uses of .| makes for a more complex program - unless they fuse away. Your chunks uses three, but your lines uses four. This has nothing to do with the lines function.Fortdefrance
@Fortdefrance see my edit. Joining two mapC improves performance, but not drastically, so there is still serious performance loss with per-line conduit processing.Grave
Until you correct the implementation of lineChunksC it will be impossible to bench this. Try it with, e.g. yield ("\nr\n\n"::B.ByteString) =$= concatC =$= mapC B.singleton instead of stdinC. This is essential to benchmarking.Fortdefrance
@Fortdefrance I updated the answer with fixed version. It didn't change the timing at all (no surprise, as with normal stdin it's called only once every N kilobytes)Grave
Right, it is fast for some nice cases, since it uses ByteString.lines where possible - but as expected, if I feed it a file that is 10 million 't's with no linebreaks, it is much slower sprunge.us/APeK If I give the file in single byte chunks char by char, it doesn't complete after 5 minutes, since it is allocating 10M separate bytestrings, while the other is done in 5 sec. It is I think quadratic in the number of chunks per line. Here's my source sprunge.us/gVQGFortdefrance
@Fortdefrance thanks, it's a good example. Building strict bytestring from many pieces is gonna be slow.Grave
N
3

My guess is that, for "lines", the mapC ("000" ++) .| mapC (`snoc` 10) part is doing a lot of work.

Concatenating several strict ByteStrings into another strict ByteString is expensive. Concatenating them into a lazy ByteString tends to be more efficient.

To avoid this cost, you can yield each part individually downstream as a strict ByteString (but be aware that then we aren't talking about "lines" anymore).

Alternatively, yield each transformed line as a lazy ByteString downstream.


The question is, did I miss a simple and elegant solution which would allow me to write efficient code in same fashion as with lines approach?

Some streaming libraries have an interesting feature: you can delimit lines in the stream, and manipulate them, without the need to materialize entire lines in memory at any point.

Here I'm using the streaming and streaming-bytestring packages, because I'm more familiar with them.

In module Data.ByteString.Streaming.Char8 of streaming-bytestring, we have the lines function:

lines :: Monad m => ByteString m r -> Stream (ByteString m) m r

lines turns a ByteString into a connected stream of ByteStrings at divide at newline characters. The resulting strings do not contain newlines. This is the genuinely streaming lines which only breaks chunks, and thus never increases the use of memory.

The gist of it is that ByteString m r is already a streaming type! So this version of lines transforms a stream into a "stream of streams". And we can only reach the "next stream" (next line) by exhausting the "current stream" (current line).

Your "lines" example could be written as:

{-# language OverloadedStrings #-}
module Main where

import Control.Applicative ((*>))
import Streaming
import qualified Streaming.Prelude as S
import qualified Data.ByteString.Streaming.Char8 as Q

main :: IO ()
main = Q.stdout
     . Q.unlines
     . S.maps (\line -> "000" *> line)
     . Q.lines
     $ Q.stdin
Nez answered 29/10, 2016 at 16:5 Comment(1)
Thanks for the answer, I should defnitely check streaming. But after adding your code into benchmark, results are on par with my "lines" solution (+/- 5%), so no immediate improvements in my case.Grave

© 2022 - 2024 — McMap. All rights reserved.