Parsing very large XML documents (and a bit more) in java
Asked Answered
L

6

19

(All of the following is to be written in Java)

I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:

First, the stream will be decrypted according to the aforementioned algorithm.

Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.

Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.

Here is my question:

Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.

Lallans answered 10/12, 2008 at 12:41 Comment(0)
D
12

Stax is the right way. I would recommend looking at Woodstox

Debarath answered 10/12, 2008 at 13:41 Comment(0)
H
7

This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...

The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...

The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.

Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.

Highroad answered 10/12, 2008 at 13:24 Comment(2)
This looks promising, as long as I can hook in to it efficiently. It looks like I'll have to expose StAX to my API's clients, which is less than ideal, but at least it looks like the capabilities are there. Can you amend your post with a recommended implementation, instead of the list?Lallans
I know this is an old answer/comment, but there are some libs that can add bit more convenience on top of stax (and isolate some lower level details), for example StaxMate [staxmate.codehaus.org/Tutorial]. This still allows for incremental parsing/writing, but reduces amount of code to write.Trossachs
C
3

You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.

If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.

A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.

Cacology answered 10/12, 2008 at 12:59 Comment(0)
H
3

I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.

SAXParserFactory.newInstance().newSAXParser().parse(
  new DecryptingInputStream(), 
  new MyHandler()
);
Hellkite answered 10/12, 2008 at 13:57 Comment(0)
L
1

You might be interested by XOM:

XOM is fairly unique in that it is a dual streaming/tree-based API. Individual nodes in the tree can be processed while the document is still being built. The enables XOM programs to operate almost as fast as the underlying parser can supply data. You don't need to wait for the document to be completely parsed before you can start working with it.

XOM is very memory efficient. If you read an entire document into memory, XOM uses as little memory as possible. More importantly, XOM allows you to filter documents as they're built so you don't have to build the parts of the tree you aren't interested in. For instance, you can skip building text nodes that only represent boundary white space, if such white space is not significant in your application. You can even process a document piece by piece and throw away each piece when you're done with it. XOM has been used to process documents that are gigabytes in size.

Linage answered 10/12, 2008 at 13:21 Comment(1)
That looks like an interesting, and potentially useful approach, but nowhere in the documentation there does it suggest a way to control the parsing of the document in the way you describe. I believe you that it can but the capability is not documented in a way that is reasonable to discover.Lallans
C
0

Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.

If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

Cowans answered 10/3, 2011 at 21:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.