Parsing (very) large XML files with XmlSlurper
Asked Answered
F

2

7

I am kind of new to Groovy and I am trying to read a (quite) large XML file (more than 1Gb) using XmlSlurper, which is supposed to work wonders with large files due to the fact that it doesn't build the whole DOM in memory.

Nevertheless I keep getting "OutOfMemoryError : Java heap space" which makes me think that there obviously is something that I'm doing wrong. I tried increasing the Xmx setting but I would rather solve the problem since I may have to deal with even bigger files afterwards.

Here is the line of code I used:

def posts = new XmlSlurper().parse(new File("posts.xml"))

Any hint on what's wrong ?

Thanks in advance,

Jérémie.

Flavour answered 2/4, 2012 at 13:31 Comment(1)
This question is similar: #4104764Mathews
G
8

Groovy's XmlSlurper is a SAX parser, but loads the entire model into memory...

To avoid OOM exceptions, you probably need to either up your memory allowance (as you say, using the -Xmx setting), or you can write your own SAX parser to get just the data you require from the document

Greige answered 2/4, 2012 at 13:45 Comment(2)
Well that explains it. Thanks !Ritenuto
@Greige Does Groovy have a decent way to read large XML's on a pure path basis without loading the whole model into memory?Fallal
A
4

I'm a bit late to this party, but I've been having the same issue also.

I made a proposition to the groovy-user mailing list, actually proposing to add something that looks like the XML::Twig perl module to XmlSlurper.

def xpathSlurper = new XPathXmlSlurper2();    
def c = { twig, it ->      
    println it.text().trim();
    twig.purgeCurrent();
}
xpathSlurper.setTwigRootHandler(xpath, c);
def fdata = xpathSlurper.parse(new File("test.xml")); 

I've attached the sample code here: http://groovy.329449.n5.nabble.com/first-step-toward-Xml-Twig-for-Groovy-groovy-util-XPathXmlSlurper2-groovy-td4923577.html

I hope this helps!

Astro answered 5/4, 2012 at 12:58 Comment(1)
Right now I solved my problem by writing my own SAX Parser as tim_yates suggested but since I am bound to deal with similar (and probably bigger) quantities of data in the future I'd be glad to have something like that. Thanks for pointing it out !Ritenuto

© 2022 - 2024 — McMap. All rights reserved.