why is sax parsing faster than dom parsing ? and how does stax work?
Asked Answered
S

4

11

somewhat related to: libxml2 from java

yes, this question is rather long-winded - sorry. I kept is as dense as I felt possible. I bolded the questions to make it easier to peek at before reading the whole thing.

Why is sax parsing faster than dom parsing? The only thing I can come up with is that w/ sax you're probably ignoring the majority of the incoming data, and thus not wasting time processing parts of the xml you don't care about. IOW - after parsing w/ SAX, you can't recreate the original input. If you wrote your SAX parser so that it accounted for each and every xml node (and could thus recreate the original), then it wouldn't be any faster than DOM would it?

The reason I'm asking is that I'm trying to parse xml documents more quickly. I need to have access to the entire xml tree AFTER parsing. I am writing a platform for 3rd party services to plug into, so I can't anticipate what parts of the xml document will be needed and which parts won't. I don't even know the structure of the incoming document. This is why I can't use jaxb or sax. Memory footprint isn't an issue for me because the xml documents are small and I only need 1 in memory at a time. It's the time it takes to parse this relatively small xml document that is killing me. I haven't used stax before, but perhaps I need to investigate further because it might be the middle ground? If I understand correctly, stax keeps the original xml structure and processes the parts that I ask for on demand? In this way, the original parse time might be quick, but each time I ask it to traverse part of the tree it hasn't yet traversed, that's when the processing takes place?

If you provide a link that answers most of the questions, I will accept your answer (you don't have to directly answer my questions if they're already answered elsewhere).

update: I rewrote it in sax and it parses documents on avg 2.1 ms. This is an improvement (16% faster) over the 2.5 ms that dom was taking, however it is not the magnitude that I (et al) would've guessed

Thanks

Stasiastasis answered 29/9, 2010 at 19:38 Comment(9)
I'd say the question of which is faster is irrelevant for your purposes, because you need to make arbitrary queries against the tree. Which means that you have to build some representation of the tree, and have some way to create queries against it. So either you use DOM/XPath, or you write your own equivalents.Ala
I suspect, however, that your real issue is not SAX vs DOM per se, but how your system is configured and/or how you're accessing the data. It really shouldn't take that long to parse a "small" document using DOM (or one of the DOM equivalents). Have you quantified the difference (that you're seeing) between SAX and DOM?Ala
I have quantified the DOM approach. small (approx. 300k) xml documents. The current implementation is using xerces-j and it takes about 2.5 ms per xml document on a 1.5 GHz machine. to quantify sax is somewhat dependent on how much of the xml you choose to keep around and what you do with it. you're right - I don't think sax will work for me - the question was more out of curiosity.Stasiastasis
2.5 ms really doesn't seem that bad. If you're just looking to satisfy curiosity, I'd suggest the following comparison programs: (1) read the file using an InputStreamReader that does a UTF-8 conversion, and (2) parse the document via SAX, using an empty DefaultHandler (ie, let it parse and dispatch, but don't do anything with the results).Ala
That said, garbage collection can be an issue if you're pushing a lot of documents through DOM: they tend to stick around long enough to get into the tenured generation.Ala
BTW - assuming you're using the Sun JVM, its internal parser is Xerces. Unless you need to work around a bug, I wouldn't go to the effort of explicitly using a Xerces distribution.Ala
You are getting down votes for being rude. No one here has any obligation to help you. Stop being arrogant towards those who are offering you help.Chappell
@anderson: It would increase legibility if you would use the canonical capitalization for DOM/SAX/StAX... Normally it wouldn't matter much, but with SAX/StAX having such potential for confusion, every little bit helps.Sterigma
Simple explanation of the differences between Sax and Dom javarevisited.blogspot.in/2011/12/…Careen
M
15

Assuming you do nothing but parse the document, the ranking of the different parser standards is as follows:

1. StAX is the fastest

  • The event is reported to you

2. SAX is next

  • It does everything StAX does plus the content is realized automatically (element name, namespace, attributes, ...)

3. DOM is last

  • It does everything SAX does and presents the information as an instance of Node.

Your Use Case

  • If you need to maintain all of the XML, DOM is the standard representation. It integrates cleanly with XSLT transforms (javax.xml.transform), XPath (javax.xml.xpath), and schema validation (javax.xml.validation) APIs. However if performance is key, you may be able to build your own tree structure using StAX faster than a DOM parser could build a DOM.
Maggoty answered 29/9, 2010 at 20:3 Comment(4)
Um, what do you think happens when "The event is reported to you" versus "the content is realized automatically"?Ala
StAX will report that element is started, if you never ask for the element name or URI then that data need not ever be realized as String objects. On the other hand a SAX parser will realize that data as String objects as part of the event.Maggoty
Perhaps. And if you tell me that you've looked at the internals of StaX and it's built around a character-based state machine, I'll believe you. However, I would expect it to generate tokens internally, even if you never ask for them.Ala
I have never looked at the internals of a StAX parser. If you only consider the namespace URI aspect. A realized token could be "foo:bar". A SAX parser will resolve the "foo" prefix to a namespace URI and report it, a StAX parser does not need to do that work regarding a token, therefore perform faster.Maggoty
W
12

DOM parsing requires you to load the entire document into memory and then traverse a tree to find the information you want.

SAX only requires as much memory as you need to do basic IO, and you can extract the information that you need as the document is being read. Because SAX is stream oriented, you can even process a file which is still being written by another process.

Wellestablished answered 29/9, 2010 at 19:42 Comment(7)
yes, I understand that. My question was "why is sax parsing faster?" not "what's the difference between sax and dom?"Stasiastasis
@Stargazer712 - mikerobi's answer did not address my question. I doubt he/she even read the question. It is a rote answer to any dom/sax question. I have an open mind if someone would put in the time to provide a thoughtful answer.Stasiastasis
@andersonbd1, I ready your question, I'm sorry you didn't understand my answer. To me it is pretty obvious that a process that requires more memory, and will not give you access to the data until it is completely parsed will be slower than a process that requires very little memory and allows you to access data almost as fast as it can be read.Wellestablished
@andersonbd1, reading your question it seemed to me a couple of times that you really didn't understand the difference between SAX and DOM, and that this was a factor in your not knowing the answer to your question. Given that, mikerobi's answer is justified. You may indeed understand the difference, but your question doesn't make that clear. Alleging that he didn't read the question or gave a thoughtless answer is unnecessary.Sterigma
@Wellestablished - my apologies for alleging that you didn't read the question. I got a bit defensive because I was charged with not having an "open mind". But enough w/ the drama - it doesn't seem to me that requiring memory should take 2.5 ms. I do a lot more memory intensive operations in a lot less time.Stasiastasis
@andersonbd1, I still don't think you understand. Reading something into memory and then using it has complexity of roughly O(2*n). Reading something and using it as you read it has complexity of roughly O(n). It is not about memory requirements, it is about the complexity of the algorithm.Chappell
Isn't O(2 * n) = O(n)? 🤨Polyploid
L
12

SAX is faster because DOM parsers often use a SAX parser to parse a document internally, then do the extra work of creating and manipulating objects to represent each and every node, even if the application doesn't care about them.

An application that uses SAX directly is likely to utilize the information set more efficiently than a DOM "parser" does.

StAX is a happy medium where an application gets a more convenient API than SAX's event-driven approach, yet doesn't suffer the inefficiency of creating a complete DOM.

Lear answered 29/9, 2010 at 19:42 Comment(0)
C
2

SAX is faster than DOM (usually felt when reading large XML document) because SAX gives you information as a sequence of events (usually accessed through a handler) while DOM creates Nodes and manages the node creation structure until a DOM tree is fully created (as represented in the XML document).

For relatively small files, you won't feel the effect (except that possibly that extra processing is done by DOM to create Node element and/or Node lists).

I can't really comment on StAX since I've never played with it.

Cusack answered 29/9, 2010 at 20:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.