Processing a very large text file with lazy Texts and ByteStrings
Asked Answered
B

2

10

I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map to keep track of the counts of each word as I traverse the file. The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file! The code is shown below. Please note that I tried the following:

  • Using Data.HashMap.Strict instead of Data.Map.Strict. Data.Map seems to perform better in terms of slower memory consumption increase rate.

  • Reading the files using lazy ByteString instead of lazy Text. And then I encode it to Text do some processing and then encode it back to ByteString for IO.

    import Data.Text.Lazy (Text(..), cons, pack, append)
    import qualified Data.Text.Lazy as T
    import qualified Data.Text.Lazy.IO as TI
    import Data.Map.Strict hiding (foldr, map, foldl')
    import System.Environment
    import System.IO
    import Data.Word
    
    dictionate :: [Text] -> Map Text Word16
    dictionate = fromListWith (+) . (`zip` [1,1..])
    
    main = do
        [file,out] <- getArgs
        h <- openFile file ReadMode
        hO <- openFile out WriteMode
        mapM_ (flip hSetEncoding utf8) [h,hO]
        txt <- TI.hGetContents h
        TI.hPutStr hO . T.unlines . 
          map (uncurry ((. cons '\t' . pack . show) . append)) . 
          toList . dictionate . T.words $ txt
        hFlush hO
        mapM_ hClose [h,hO]
        print "success"
    

What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?

Biforate answered 4/11, 2013 at 23:54 Comment(23)
How many distinct words are there, roughly, in the file? That should give quite a hint as to whether such high memory consumption is inevitable.Hypoplasia
Are you reading the whole file into memory in order to process it? If so, that explains the high memory usage. Try reading in the file line by line.Miso
@acfrancis: Data.Text.Lazy.IO.hGetContents should certainly get that point right.Hypoplasia
@Hypoplasia let's assume the worst-case, all words in the file are unique. Should the Map size reach 30GB?Biforate
That's surely possible. A tree-like structure such as Map has lots of overhead, I'd suspect (assuming x86-64) ~4 machine words per cell, extra two for each Text, and finally 1-2 to store the words' characters themselves. Add some total overhead for garbage collection, and 30GB seem not unlikely for hundreds of millions of words.Hypoplasia
@Hypoplasia is there a way to keep parts of this data-structure on disk while still maintaining fast access for the parts currently in-use?Biforate
I suppose it is, but probably quite difficult for a Map. If you don't expect much duplicates among the words, perhaps you should completely switch to something else. External merge sort (with duplicate-counting & nubbing as a separate step) is relatively simple. Of course anything external is bound to involve lots of dirty IO; I wager it's actually easier to code this up in C or C++, which also give you much better control over the data structures' overhead.Hypoplasia
Anecdotal supporting evidence: I just ran your program on a 516MB text file, generated by concatenating 512 copies of the text of a book. Memory consumption remained constant at ~16MB; the final output listed 25180 words.Algolagnia
@Algolagnia That's really weird. What version of GHC are you using and on which architecture? Also are you processing unicode text? I am processing a mix of arabic and english text.Biforate
@Hypoplasia is there anything of this sort already written for Haskell somewhere?Biforate
It's not weird at all, @Algolagnia just used a text file with way less unique words (inevitable for a natural language and not helped by the use of many copies of the same text) than you apparently have in your file.Hypoplasia
7.6.3, x86-64. Without knowing your input, I guess my results support what leftaroundabout said (my file was made from natural language text, so there are relatively few distinct words).Algolagnia
I wonder why Data.HashMap is much more slower than Data.Map although it's written on the unordered-containers page: "The containers have been optimized for performance critical use, both in terms of large data quantities and high speed".Biforate
Why not try replacing Map then? It ought to be trivial to replace Map with HashMap and easy to replace it with Hashtable.Eaglet
@J.Abrahamson I did. That's how I got to know it's much slower :)Biforate
@Biforate Ah, you commented as I wrote mine I think.Eaglet
Just to ask the stupid questions: you're compiling with -O2, right?Melissa
@DanielWagner: yes, tried both -O and -O2 :)Biforate
You might profit from using something like bytestring-trieEaglet
Try finding efficient word count in some other languages and compare. That may give you insight into the problem (i.e. Dictionary has high overhead? Making copies of strings?)Pistil
seconding @J.Abrahamson's suggestion; a trie is at least in theory a much better data structure for this.Saintsimonianism
...oh, and of course the obligatory "what did you learn from profiling?" commentSaintsimonianism
Have you considered using a Trie for this? It's a very good tool for this kind of analysis: en.wikipedia.org/wiki/Trie hackage.haskell.org/package/TrieMap Tries have particular advantages over binary tries or hashmaps for your kind of problem.Propman
S
7

This memory usage is expected. Data.Map.Map consumes about 6N words of memory + size of keys & values (data taken from this excellent post by Johan Tibell). A lazy Text value takes up 7 words + 2*N bytes (rounded to the multiple of the machine word size), and a Word16 takes up two words (header + payload). We will assume a 64-bit machine, so the word size will be 8 bytes. We will also assume that the average string in the input is 8 characters long.

Taking this all into account, the final formula for the memory usage is 6*N + 7*N + 2*N + 2*N words.

In the worst case, all words will be different and there will be about (6 * 1024^3)/8 ~= 800 * 10^6 of them. Plugging that in the formula above we get the worst-case map size of approx. 102 GiB, which seems to agree with the experimental results. Solving this equation in the reverse direction tells us that your file contains about 200*10^6 different words.

As for alternative approaches to this problem, consider using a trie (as suggested by J.Abrahamson in the comments) or an approximate method, such as count-min sketch.

Showery answered 5/11, 2013 at 8:34 Comment(0)
N
0

In the world of traditional data processing, this problem would have been done by sorting (externally on disk or magtape if needed), then scanning the sorted file to count the grouped-together runs of words. Of course you could do some partial reductions during the early phases of sorting, to save some space and time.

Nimitz answered 13/11, 2013 at 6:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.