efficiently reading a large file into a Map
Asked Answered
U

2

7

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).

This was my code:

{-# LANGUAGE OverloadedStrings #-}

import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid

text = do
    x <- DTLIO.readFile "../../../../etymwn.tsv"
    return $ DTL.take 10000 x

--parsers
wordpair = do
    x <- ATL.takeTill (== ':')
    ATL.char ':' *> (ATL.many' $ ATL.char ' ')
    y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
    ATL.char '\n' <|>   ATL.char '\t'
    return (x,y)

--line of file
line = do
    a <- (ATL.count 3 wordpair)
    case (rel (a !! 2)) of 
        True -> return . (\[a,b,c] -> [(a,c)]) $ a
        False -> return . (\[a,b,c] -> [(c,a)]) $ a
    where rel x = if x == ("rel","etymological_origin_of") then False else True

tsv = do 
    x <- ATL.many1 line
    return $ fmap M.fromList x

main = (putStrLn . show . ATL.parse tsv) =<< text

It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with

foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv

So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?

Thanks,

Reuben

Usher answered 29/7, 2015 at 10:28 Comment(4)
Did you profile your code? nikita-volkov.github.io/profiling-cabal-projects downloads.haskell.org/~ghc/latest/docs/html/users_guide/… book.realworldhaskell.org/read/profiling-and-optimization.htmlSibel
If the file that you are reading is too big, a good option to reduce the startup time of the program is to move the contents of that file into a database (embedded or not). Once indexed in the database, random lookups can be performed directly without reading the file sequentially first.Heat
In addition to profiling I suggest going through this short guide about performance considerations: hackage.haskell.org/package/attoparsec-0.13.0.1/docs/…Shuttle
I'm just guessing here: the parser might need to scan the whole file to choose whether to return a successful value or a failure one. I.e. ATL.many1 might not be that lazy, forcing everything in memory anyway.Ennius
Y
5

Note that the file you want to load has 6 million lines and the text you are interested in storing comprises approx. 120 MB.

Lower Bounds

To establish some lower bounds I first created another .tsv file containing the preprocessed contents of the etymwn.tsv file. I then timed how it took for this perl program to read that file:

my %H;
while (<>) {
  chomp;
  my ($a,$b) = split("\t", $_, 2);
  $H{$a} = $b;
}

This took approx. 17 secs., so I would expect any Haskell program to take about that about of time.

If this start-up time is unacceptable, consider the following options:

  1. Work in ghci and use the "live reloading" technique to save the map using the Foreign.Store package so that it persists through ghci code reloads. That way you only have to load the map data once as you iterate your code.
  2. Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
  3. Access the data through a client-server store
  4. Reduce the number of key-value pairs you store (do you need all 6 million?)

Option 1 is discussed in this blog post by Chris Done:

Options 2 and 3 will require you to work in the IO monad.

Parsing

First of all, check the type of your tsv function:

tsv :: Data.Attoparsec.Internal.Types.Parser
          DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]

You are returning a list of maps instead of just one map. This doesn't look right.

Secondly, as @chi suggested, I doubt that using attoparsec is lazy. In partcular, it has to verify that the entire parse succeeds, so I can't see how it cannot avoid creating all of the parsed lines before returning.

To truely parse the input lazily, take the following approach:

toPair :: DT.Text -> (Key, Value)
toPair input = ...

main = do
  all_lines <- fmap DTL.lines $ DTLIO.getContent
  let m = M.fromList $ map toPair all_lines
  print $ M.lookup "foobar" m

You can still use attoparsec to implement toPair, but you'll be using it on a line-by-line basis instead of on the entire input.

ByteString vs. Text

In my experience working with ByteStrings is much faster than working with Text.

This version of toPair for ByteStrings is about 4 times faster than the corresponding version for Text:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL

toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
  case AL.maybeResult (AL.parse parseLine bs) of
    Nothing    -> error "bad line"
    Just (a,b) -> (a,b)
  where parseLine = do
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          a <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          rel <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          c <- A.takeWhile (const True)
          if rel == "rel:etymological_origin_of"
            then return (c,a)
            else return (a,c)

Or, just use plain ByteString functions:

fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')

snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')

toPair'' bs = 
  let fs = fields bs
  case fields line of
    (x:y:z:_) -> let a = snipSpace x
                     c = snipSpace z
                 in
                 if y == "rel:etymological_origin_of"
                   then (c,a)
                   else (a,c)
    _         -> error "bad line"

Most of the time spent loading the map is in parsing the lines. For ByteStrings this is about 14 sec. to load all 6 million lines vs. 50 secs. for Text.

Yellowstone answered 29/7, 2015 at 20:48 Comment(0)
P
0

To add to this answer, I'd like to note that attoparsec actually has very good support for "pull-based" incremental parsing. You can use this directly with the convenient parseWith function. For even finer control, you can feed the parser by hand with parse and feed. If you don't want to worry about any of this, you should be able to use something like pipes-attoparsec, but personally I find pipes a bit hard to understand.

Pasteup answered 29/7, 2015 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.