I'm trying to process a very large unicode text file (6GB+). What I want is to count the frequency of each unique word. I use a strict Data.Map
to keep track of the counts of each word as I traverse the file.
The process takes too much time and too much memory (20GB+). I suspect the Map is huge but I'm not sure it should reach 5x the size of the file!
The code is shown below. Please note that I tried the following:
Using
Data.HashMap.Strict
instead ofData.Map.Strict
.Data.Map
seems to perform better in terms of slower memory consumption increase rate.Reading the files using lazy
ByteString
instead of lazyText
. And then I encode it to Text do some processing and then encode it back toByteString
forIO
.import Data.Text.Lazy (Text(..), cons, pack, append) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TI import Data.Map.Strict hiding (foldr, map, foldl') import System.Environment import System.IO import Data.Word dictionate :: [Text] -> Map Text Word16 dictionate = fromListWith (+) . (`zip` [1,1..]) main = do [file,out] <- getArgs h <- openFile file ReadMode hO <- openFile out WriteMode mapM_ (flip hSetEncoding utf8) [h,hO] txt <- TI.hGetContents h TI.hPutStr hO . T.unlines . map (uncurry ((. cons '\t' . pack . show) . append)) . toList . dictionate . T.words $ txt hFlush hO mapM_ hClose [h,hO] print "success"
What's wrong with my approach? What's the best way to accomplish what I'm trying to do in terms of time and memory performance?
Data.Text.Lazy.IO.hGetContents
should certainly get that point right. – HypoplasiaMap
has lots of overhead, I'd suspect (assuming x86-64) ~4 machine words per cell, extra two for eachText
, and finally 1-2 to store the words' characters themselves. Add some total overhead for garbage collection, and 30GB seem not unlikely for hundreds of millions of words. – HypoplasiaMap
. If you don't expect much duplicates among the words, perhaps you should completely switch to something else. External merge sort (with duplicate-counting & nubbing as a separate step) is relatively simple. Of course anything external is bound to involve lots of dirty IO; I wager it's actually easier to code this up in C or C++, which also give you much better control over the data structures' overhead. – Hypoplasiaunordered-containers
page: "The containers have been optimized for performance critical use, both in terms of large data quantities and high speed". – Biforate-O2
, right? – Melissa-O
and-O2
:) – Biforatebytestring-trie
– Eaglet