I have a JSON file containing quite a lot of test data, which I want to parse and push through an algorithm I'm testing. It's about 30MB in size, with a list of 60,000 or so elements. I initially tried the simple parser in scala.util.parsing.json, like so:
import scala.util.parsing.json.JSON
val data = JSON.parseFull(Source.fromFile(path) mkString)
Where path is just a string containing the path the big JSON file. That chugged away for about 45 minutes, then threw this:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Someone then pointed out to me that nobody uses this library and I should use Lift's JSON parser. So I tried this in my Scala REPL:
scala> import scala.io.Source
import scala.io.Source
scala> val s = Source.fromFile("path/to/big.json")
s: scala.io.BufferedSource = non-empty iterator
scala> val data = parse(s mkString)
java.lang.OutOfMemoryError: GC overhead limit exceeded
This time it only took about 3 minutes, but the same error.
So, obviously I could break the file up into smaller ones, iterate over the directory of JSON files and merge my data together piece-by-piece, but I'd rather avoid it if possible. Does anyone have any recommendations?
For further information -- I'd been working with this same dataset the past few weeks in Clojure (for visualization with Incanter) without issues. The following works perfectly fine:
user=> (use 'clojure.data.json)
nil
user=> (use 'clojure.java.io)
nil
user=> (time (def data (read-json (reader "path/to/big.json"))))
"Elapsed time: 19401.629685 msecs"
#'user/data