Clojure - process huge files with low memory

Asked 17/12, 2015 at 8:25 Answered 18/12, 2015 at 1:19

clojure memory-efficient file-processing

I am processing text files 60GB or larger. The files are seperated into a header section of variable length and a data section. I have three functions:

head? a predicate to distinguish header lines from data lines
process-header process one header line string
process-data process one data line string
The processing functions asynchronously access and modify an in-memory database

I advanced on a file reading method from another SO thread, which should build a lazy sequence of lines. The idea was to process some lines with one function, then switch the function once and keep processing with the next function.

(defn lazy-file
  [file-name]
  (letfn [(helper [rdr]
            (lazy-seq
             (if-let [line (.readLine rdr)]
               (cons line (helper rdr))
               (do (.close rdr) nil))))]
    (try
      (helper (clojure.java.io/reader file-name))
      (catch Exception e
        (println "Exception while trying to open file" file-name)))))

I use it with something like

(let [lfile (lazy-file "my-file.txt")]
  (doseq [line lfile :while head?]
    (process-header line))
  (doseq [line (drop-while head? lfile)]
    (process-data line)))

Although that works, it's rather inefficient for a couple of reasons:

Instead of simply calling process-head until I reach the data and then continuing with process-data, I have to filter header lines and process them, then restart parsing the whole file and drop all header lines to process data. This is the exact opposite of what lazy-file intended to do.
Watching memory consumption shows me, that the program, though seemingly lazy, builds up to use as much RAM as would be required to keep the file in memory.

So what is a more efficient, idiomatic way to work with my database?

One idea might be using a multimethod to process header and data dependant on the value of the head? predicate, but I suppose this would have some serious speed impact, especially as there is only one occurence where the predicate outcome changes from alway true to always false. I didn't benchmark that yet.

Would it be better to use another way to build the line-seq and parse it with iterate? This would still leave me needing to use :while and :drop-while, I guess.

In my research, using NIO file access was mentioned a couple of times, which should improve memory usage. I could not find out yet how to use that in an idiomatic way in clojure.

Maybe I still have a bad grasp of the general idea, how the file should be treated?

As always, any help, ideas or pointers to tuts are greatly appreciated.

Abdias answered 17/12, 2015 at 8:25 Comment(0)

You should use standard library functions.

line-seq, with-open and doseq will easily do the job.

Something in the line of:

(with-open [rdr (clojure.java.io/reader file-path)]
  (doseq [line (line-seq rdr)]
    (if (head? line)
      (process-header line)
      (process-data line))))

Applause answered 18/12, 2015 at 1:19 Comment(5)

Thanks for your suggestion. The lazy-file method I am using was implemented when I started learning clojure, stowed away in an io module and used from there. The net effect of it is truly the same as just using line-seq. – Abdias 18/12, 2015 at 12:37

Another side information, the if-else approach per line proved to be significantly slower (factor 1.5) than the way I was taking. Significantly because runtime here is measured in hours ;-) – Abdias 18/12, 2015 at 12:47

I understand your argument about lazy-file, but dealing with opening and closing the file make this function harder to unit test. – Applause 19/12, 2015 at 0:36

The problem with the memory is that you hold the head of the lazy seq in your let binding. When you process the lines they are kept in memory according to the seq documentation. – Applause 19/12, 2015 at 0:42

About the if, if it's to costly because of the file size, your approach of opening the file twice is definitely a valid one. – Applause 19/12, 2015 at 0:57

There are several things to consider here:

Memory usage

There are reports that leiningen might add stuff that results in keeping references to the head, although doseq specifically does not hold on to the head of the sequence it's processing, cf. this SO question. Try verifying your claim "use as much RAM as would be required to keep the file in memory" without using lein repl.
Parsing lines

Instead of using two loops with doseq, you could also use a loop/recur approach. What you expect to be parsing would be a second argument like this (untested):
```
    (loop [lfile (lazy-file "my-file.txt")
           parse-header true]
       (let [line (first lfile)]
            (if [and parse-header (head? line)]
                (do (process-header line)
                    (recur (rest lfile) true))
                (do (process-data line)
                    (recur (rest lfile) false)))))
```
There is another option here, which would be to incorporate your processing functions into your file reading function. So, instead of just consing a new line and returning it, you could just as well process it right away -- typically you could hand over the processing function as an argument instead of hard-coding it.

Your current code looks like processing is a side-effect. If so, you could then probably do away with the laziness if you incorporate the processing. You need to process the entire file anyway (or so it seems) and you do so on a per-line basis. The lazy-seq approach basically just aligns a single line read with a single processing call. Your need for laziness arises in the current solution because you separate reading (the entire file, line by line) from processing. If you instead move the processing of a line into the reading, you don't need to do that lazily.

Extrasensory answered 17/12, 2015 at 11:12 Comment(2)

Thanks for your answer. Yesterday I wrote some test cases to do benchmarking. It turned out that A) It's not the reading itself that consumes that much memory, it seems to be the database (btw, my memory consumption claims stem from running the compiled application) B) lazy-file and line-seq perform roughly equal, considering speed and memory usage C) Surprisingly multimethods and a loop-recur approach would require about 150% of the time required to open the file twice and use while/drop-while – Abdias 18/12, 2015 at 12:38

I like your way of recursion while reading the file. The next idea I'll try is, that I'll have the header-parser check if the next line is a data line (iterator style) and, if so, trampoline away to the data parser. If-else on each line is really slow, but the files are well defined into a few hundred header lines and hundreds of millions of data lines, and reading the head takes less than half a second. I'm just not sure yet, how to combine trampoline and iterator... – Abdias 18/12, 2015 at 12:52

Recommended topics

Hot tags