Parsing a large (30MB) JSON file with net.liftweb.json or scala.util.parsing.json gives OutOfMemoryException. Any recommendations?
Asked Answered
H

2

13

I have a JSON file containing quite a lot of test data, which I want to parse and push through an algorithm I'm testing. It's about 30MB in size, with a list of 60,000 or so elements. I initially tried the simple parser in scala.util.parsing.json, like so:

import scala.util.parsing.json.JSON
val data = JSON.parseFull(Source.fromFile(path) mkString)

Where path is just a string containing the path the big JSON file. That chugged away for about 45 minutes, then threw this:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Someone then pointed out to me that nobody uses this library and I should use Lift's JSON parser. So I tried this in my Scala REPL:

scala> import scala.io.Source
import scala.io.Source

scala> val s = Source.fromFile("path/to/big.json")
s: scala.io.BufferedSource = non-empty iterator

scala> val data = parse(s mkString)
java.lang.OutOfMemoryError: GC overhead limit exceeded

This time it only took about 3 minutes, but the same error.

So, obviously I could break the file up into smaller ones, iterate over the directory of JSON files and merge my data together piece-by-piece, but I'd rather avoid it if possible. Does anyone have any recommendations?

For further information -- I'd been working with this same dataset the past few weeks in Clojure (for visualization with Incanter) without issues. The following works perfectly fine:

user=> (use 'clojure.data.json)
nil
user=> (use 'clojure.java.io)
nil

user=> (time (def data (read-json (reader "path/to/big.json"))))
"Elapsed time: 19401.629685 msecs"
#'user/data
Hurless answered 17/1, 2012 at 16:42 Comment(0)
S
9

Those messages indicate that the application is spending more than 98% of its time collecting garbage.

I'd suspect that Scala is generating a lot of short-lived objects, which is what is causing the excessive GCs. You can verify the GC performance by adding the -verbosegc command line switch to java.

The default max heap size on Java 1.5+ server VM is 1 GB (or 1/4 of installed memory, whichever is less), which should be sufficient for your purposes, but you may want to increase the new generation to see if that improves your performance. On the Oracle VM, this is done with the -Xmn option. Try setting the following environment variable:

$JAVA_OPTS=-server -Xmx1024m -Xms1024m -Xmn2m -verbosegc -XX:+PrintGCDetails

and re-running your application.

You should also check out this tuning guide for details.

Sanburn answered 17/1, 2012 at 17:47 Comment(2)
I ran:$ JAVA_OPTS="-Xmx1024m -Xms1024m -Xmn2m" scala -classpath lift-json_2.9.0-1-2.4.jar:paranamer-2.1.jar Took longer, but this time I got: java.lang.OutOfMemoryError: Java heap space I'll try also boosting the heap space and post back. Thanks for the comprehensive answer in any case. Useful links!Hurless
OK! with the heap size at 2gb and the other options as you proposed, it works. It takes about 10 minutes to parse the file and uses a lot of memory, which is a bit annoying but for this automated performance test is acceptable. Thanks for your help!Hurless
L
3

Try using Jerkson instead. Jerkson uses Jackson underneath, which repeatedly scores as the fastest and most memory efficient JSON parser on the JVM.

I've used both Lift JSON and Jerkson in production, and Jerkson's performance was signficantly better than Lift's (especially when parsing and generating large JSON documents).

Landmass answered 17/2, 2012 at 1:42 Comment(1)
Project is abandoned. See this post for a great list of alternatives: engineering.ooyala.com/blog/comparing-scala-json-librariesAspidistra

© 2022 - 2024 — McMap. All rights reserved.