Reading Huge XML File using StAX and XPath

M

7

12

The input file contains thousands of transactions in XML format which is around 10GB of size. The requirement is to pick each transaction XML based on the user input and send it to processing system.

The sample content of the file

<transactions>
    <txn id="1">
      <name> product 1</name>
      <price>29.99</price>
    </txn>

    <txn id="2">
      <name> product 2</name>
      <price>59.59</price>
    </txn>
</transactions>

The (technical)user is expected to give the input tag name like <txn>.

We would like to provide this solution to be more generic. The file content might be different and users can give a XPath expression like "//transactions/txn" to pick individual transactions.

There are few technical things we have to consider here

The file can be in a shared location or FTP
Since the file size is huge, we can't load the entire file in JVM

Can we use StAX parser for this scenario? It has to take XPath expression as a input and pick/select transaction XML.

Looking for suggestions. Thanks in advance.

Morly answered 27/8, 2011 at 16:49 Comment(1)

My recommendation is to use extended vtd-xml in mem map mode and 64 bit jvm – Illusive 5/8, 2013 at 5:47

S

16

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

Smokeproof answered 27/8, 2011 at 17:2 Comment(2)

If you're going to downvote me, please leave a comment. That way everyone learns! – Smokeproof 5/9, 2012 at 16:5

Down voting because your statement "Stax and xpath are very different things" is not correct. XPath (at least the subset of it) can still be implemented in Stax model (pull-model). Its implemented in C# msdn.microsoft.com/en-us/library/ms950778.aspx – Testosterone 10/7, 2016 at 22:19

T

18

If performance is an important factor, and/or the document size is large (both of which seem to be the case here), the difference between an event parser (like SAX or StAX) and the native Java XPath implementation is that the latter builds a W3C DOM Document prior to evaluating the XPath expression. [It's interesting to note that all Java Document Object Model implementations like the DOM or Axiom use an event processor (like SAX or StAX) to build the in-memory representation, so if you can ever get by with only the event processor you're saving both memory and the time it takes to build a DOM.]

As I mentioned, the XPath implementation in the JDK operates upon a W3C DOM Document. You can see this in the Java JDK source code implementation by looking at com.sun.org.apache.xpath.internal.jaxp.XPathImpl, where prior to the evaluate() method being called the parser must first parse the source:

  Document document = getParser().parse( source );

After this your 10GB of XML will be represented in memory (plus whatever overhead) — probably not what you want. While you may want a more "generic" solution, both your example XPath and your XML markup seem relatively simple, so there doesn't seem to be a really strong justification for an XPath (except perhaps programming elegance). The same would be true for the XProc suggestion: this would also build a DOM. If you truly need a DOM you could use Axiom rather than the W3C DOM. Axiom has a much friendlier API and builds its DOM over StAX, so it's fast, and uses Jaxen for its XPath implementation. Jaxen requires some kind of DOM (W3C DOM, DOM4J, or JDOM). This will be true of all XPath implementations, so if you don't truly need XPath sticking with just the events parser would be recommended.

SAX is the old streaming API, with StAX newer, and a great deal faster. Either using the native JDK StAX implementation (javax.xml.stream) or the Woodstox StAX implementation (which is significantly faster, in my experience), I'd recommend creating a XML event filter that first matches on element type name (to capture your <txn> elements). This will create small bursts of events (element, attribute, text) that can be checked for your matching user values. Upon a suitable match you could either pull the necessary information from the events or pipe the bounded events to build a mini-DOM from them if you found the result was easier to navigate. But it sounds like that might be overkill if the markup is simple.

This would likely be the simplest, fastest possible approach and avoid the memory overhead of building a DOM. If you passed the names of the element and attribute to the filter (so that your matching algorithm is configurable) you could make it relatively generic.

Tumescent answered 3/4, 2013 at 10:26 Comment(2)

Have you heard of vtd-xml? – Illusive 19/7, 2013 at 1:37

Not until your comment, no I hadn't. I've downloaded the distribution and will be happy to try it out. It if performs as claimed I'd consider using it in production environments, but the one hitch I see inclines me to ask (since you're its author) if you'd be willing to also release vtd-xml under an LGPL or Apache license? We simply can't use GPL in our environment. Thanks for the tip in any case. – Tumescent 1/8, 2013 at 10:5

S

16

Stax and xpath are very different things. Stax allows you to parse a streaming XML document in a forward direction only. Xpath allows parsing in both directions. Stax is a very fast streaming XML parser, but, if you want xpath, java has a separate library for that.

Take a look at this question for a very similar discussion: Is there any XPath processor for SAX model?

Smokeproof answered 27/8, 2011 at 17:2 Comment(2)

If you're going to downvote me, please leave a comment. That way everyone learns! – Smokeproof 5/9, 2012 at 16:5

Down voting because your statement "Stax and xpath are very different things" is not correct. XPath (at least the subset of it) can still be implemented in Stax model (pull-model). Its implemented in C# msdn.microsoft.com/en-us/library/ms950778.aspx – Testosterone 10/7, 2016 at 22:19

C

1

It's definitely a use case for XProc with a streaming and parallel processing implementation like QuiXProc (http://code.google.com/p/quixproc)

In this situation, you will have to use

  <p:for-each>
    <p:iteration-source select="//transactions/txn"/>
    <!-- you processing on a small file -->
  </p:for-each>

You can even wrapp each of the resulting transformation with a single line of XProc

  <p:wrap-sequence wrapper="transactions"/>

Hope this helps

Camillecamilo answered 3/9, 2011 at 7:4 Comment(0)

B

1

We regularly parse 1GB+ complex XML files by using a SAX parser which does exactly what you described: It extracts partial DOM trees that can be conveniently queried using XPATH.

I blogged about it here - It's using a SAX not a StAX parser, but may be worth a look at.

Broderickbrodeur answered 7/1, 2012 at 15:36 Comment(1)

Sounds interesting, but the source code for the blogpost appears to be no longer in Github. – Willis 1/5 at 6:25

N

1

A fun solution for processing huge XML files >10GB.

Use ANTLR to create byte offsets for the parts of interest. This will save some memory compared with a DOM based approach.
Use Jaxb to read parts from byte position

Find details at the example of wikipedia dumps (17GB) in this SO answer https://mcmap.net/q/909813/-using-stax-to-create-index-for-xml-for-quick-access

Naos answered 26/2, 2018 at 9:43 Comment(0)

L

0

Streaming Transformations for XML (STX) might be what you need.

Landtag answered 27/8, 2011 at 17:21 Comment(0)

F

0

Do you need to process it fast or you need fast lookups in the data ? These requirements need different approach.

For fast reading of the whole data StAX will be OK.

If you need fast lookups than you could need to load it to some database, Berkeley DB XML e.g.

Fishbein answered 27/8, 2011 at 19:28 Comment(0)

Recommended topics

Hot tags