Parsing ASCII file efficiently in Haskell
Asked Answered
C

2

9

I wanted to reimplement some of my ASCII parsers in Haskell since I thought I could gain some speed. However, even a simple "grep and count" is much slower than a sloppy Python implementation.

Can someone explain me why and how to do it correctly?

So the task is, count the lines which starts with the string "foo".

My very basic Python implementation:

with open("foo.txt", 'r') as f:
    print len([line for line in f.readlines() if line.startswith('foo')])

And the Haskell version:

import System.IO 
import Data.List

countFoos :: String -> Int
countFoos str = length $ filter (isPrefixOf "foo") (lines str)

main = do
    contents <- readFile "foo.txt"
    putStr (show $ countFoos contents)

Running both with time on a ~600MB file with 17001895 lines reveals that the Python implementation is almost 4 times faster than the Haskell one (running on my MacBook Pro Retina 2015 with PCIe SSD):

> $ time ./FooCounter                                                           
1770./FooCounter  20.92s user 0.62s system 98% cpu 21.858 total

> $ time python foo_counter.py                                                   
1770
python foo_counter.py  5.19s user 1.01s system 97% cpu 6.332 total

Compared to unix command line tools:

> $ time grep -c foo foo.txt                                                     
1770
grep -c foo foo.txt   4.87s user 0.10s system 99% cpu 4.972 total

> $ time fgrep -c foo foo.txt                                                     
1770
fgrep -c foo foo.txt  6.21s user 0.10s system 99% cpu 6.319 total

> $ time egrep -c foo foo.txt                                                     
1770
egrep -c foo foo.txt  6.21s user 0.11s system 99% cpu 6.317 total

Any ideas?

UPDATE:

Using András Kovács' implementation (ByteString), I got it under half a second!

> $ time ./FooCounter                                                                                                               
1770
./EvtReader  0.47s user 0.48s system 97% cpu 0.964 total
Curiel answered 16/7, 2015 at 7:52 Comment(5)
Please compile with -O2 if you haven't.Argive
Don't use String. Use either ByteString or (more likely) Text. The String type is very flexible, but very inefficient for just about everything.Mylo
@AndrásKovács I forgot to mention, I did compile it with -O2. Actually it doesn't make a difference :-\ Anyway: kösziCuriel
@Mylo but readFile has readFile :: FilePath -> IO String. How should I force using ByteString or Text?Curiel
@septi Have a look in Data.Text.IO. You'll find another readFile function that returns a Text instead.Mylo
A
11

I benchmarked the following solution:

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Char8 as B

main =
  print . length . filter (B.isPrefixOf "foo") . B.lines =<< B.readFile "test.txt"

text.txt is a 170 MB file with 8 million lines, and half of the lines start with "foo". I compiled with GHC 7.10 and -O2 -fllvm.

The ByteString version ran in 0.27 seconds, while the original version ran in 5.16 seconds.

However, the strict ByteString version uses 170 MB memory loading the full file. Changing the import to Data.ByteString.Lazy.Char8 I got 0.39 sec runtime and 1 MB memory usage.

Allistir answered 16/7, 2015 at 8:16 Comment(3)
Wow, that's impressive. Got under half a second with that!Curiel
Does this read the whole file into memory? Since I also need to parse files larger than 10GB…Curiel
Yes, it loads everything. See the edit of the answer.Argive
C
5

Your Haskell version uses the type String to represent the text of the file. String is an alias for [Char] which is a linked list of characters. That's not a good representation for large strings.

Try using the text package instead. It represents strings as arrays (in the Data.Text.* modules) or as linked lists of arrays (in the Data.Text.Lazy.* modules). To port your existing code, you probably want the latter, because I guess you don't want to load the full 600MB file into memory at once. Look in the Data.Text.Lazy and Data.Text.Lazy.IO modules for variants of the readFile, filter, isPrefixOf etc. functions you're using.

If you're sure you only want to support ASCII, you can also look into using the bytestring package instead of the text package.

Cholera answered 16/7, 2015 at 8:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.