I wanted to reimplement some of my ASCII parsers in Haskell since I thought I could gain some speed. However, even a simple "grep and count" is much slower than a sloppy Python implementation.
Can someone explain me why and how to do it correctly?
So the task is, count the lines which starts with the string "foo".
My very basic Python implementation:
with open("foo.txt", 'r') as f:
print len([line for line in f.readlines() if line.startswith('foo')])
And the Haskell version:
import System.IO
import Data.List
countFoos :: String -> Int
countFoos str = length $ filter (isPrefixOf "foo") (lines str)
main = do
contents <- readFile "foo.txt"
putStr (show $ countFoos contents)
Running both with time
on a ~600MB file with 17001895 lines reveals that the Python implementation is almost 4 times faster than the Haskell one (running on my MacBook Pro Retina 2015 with PCIe SSD):
> $ time ./FooCounter
1770./FooCounter 20.92s user 0.62s system 98% cpu 21.858 total
> $ time python foo_counter.py
1770
python foo_counter.py 5.19s user 1.01s system 97% cpu 6.332 total
Compared to unix command line tools:
> $ time grep -c foo foo.txt
1770
grep -c foo foo.txt 4.87s user 0.10s system 99% cpu 4.972 total
> $ time fgrep -c foo foo.txt
1770
fgrep -c foo foo.txt 6.21s user 0.10s system 99% cpu 6.319 total
> $ time egrep -c foo foo.txt
1770
egrep -c foo foo.txt 6.21s user 0.11s system 99% cpu 6.317 total
Any ideas?
UPDATE:
Using András Kovács' implementation (ByteString
), I got it under half a second!
> $ time ./FooCounter
1770
./EvtReader 0.47s user 0.48s system 97% cpu 0.964 total
String
. Use eitherByteString
or (more likely)Text
. TheString
type is very flexible, but very inefficient for just about everything. – MyloreadFile
hasreadFile :: FilePath -> IO String
. How should I force usingByteString
orText
? – CurielData.Text.IO
. You'll find anotherreadFile
function that returns aText
instead. – Mylo