When should I choose SAX over StAX?
Asked Answered
C

6

87

Streaming xml-parsers like SAX and StAX are faster and more memory efficient than parsers building a tree-structure like DOM-parsers. SAX is a push parser, meaning that it's an instance of the observer pattern (also called listener pattern). SAX was there first, but then came StAX - a pull parser, meaning that it basically works like an iterator.

You can find reasons why to prefer StAX over SAX everywhere, but it usually boils down to: "it's easier to use".

In the Java tutorial on JAXP StAX is vaguely presented as the middle between DOM and SAX: "it's easier than SAX and more efficient than DOM". However, I never found any clues that StAX would be slower or less memory efficient than SAX.

All this made me wonder: are there any reasons to choose SAX instead of StAX?

Clausen answered 22/9, 2011 at 21:36 Comment(0)
A
25

To generalize a bit, I think StAX can be as efficient as SAX. With the improved design of StAX I can't really find any situation where SAX parsing would be preferred, unless working with legacy code.

EDIT: According to this blog Java SAX vs. StAX StAXoffer no schema validation.

Aeronautics answered 22/9, 2011 at 21:48 Comment(2)
it's not too hard to add validation on top of stax. implemented that myself the other day.Chellean
More detail on validation: https://mcmap.net/q/243366/-stax-xml-validationBetony
V
86

Overview
XML documents are hierarchical documents, where the same element names and namespaces might occur in several places, having different meaning, and in infinitive depth (recursive). As normal, the solution to big problems, is to divide them into small problems. In the context of XML parsing, this means parsing specific parts of XML in methods specific to that XML. For example, one piece of logic would parse an address:

<Address>
    <Street>Odins vei</Street>    
    <Building>4</Building>
    <Door>b</Door>
</Address>

i.e. you would have a method

AddressType parseAddress(...); // A

or

void parseAddress(...); // B

somewhere in your logic, taking XML inputs arguments and returning an object (result of B can be fetched from a field later).

SAX
SAX 'pushes' XML events, leaving it up to you to determine where the XML events belong in your program / data.

// method in stock SAX handler
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException
    // .. your logic here for start element
}

In case of an 'Building' start element, you would need to determine that you are actually parsing an Address and then route the XML event to the method whose job it is to interpret Address.

StAX
StAX 'pulls' XML events, leaving it up to you to determine where in your program / data to receive the XML events.

// method in standard StAX reader
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
    // .. your logic here for start element
}

Of course, you would always want to receive a 'Building' event in in the method whose job it is to interpret Address.

Discussion
The difference between SAX and StAX is that of push and pull. In both cases, the parse state must be handled somehow.

This translates to method B as typical for SAX, and method A for StAX. In addition, SAX must give B individual XML events, while StAX can give A multiple events (by passing an XMLStreamReader instance).

Thus B first check the previous state of the parsing and then handle each individual XML event and then store the state (in a field). Method A can just handle the XML events all at once by accessing the XMLStreamReader multiple times until satisfied.

Conclusion
StAX lets you structure your parsing (data-binding) code according to the XML structure; so in relation to SAX, the 'state' is implicit from the program flow for StAX, whereas in SAX, you always need to preserve some kind of state variable + route the flow according to that state, for most event calls.

I recommend StAX for all but the simplest documents. Rather move to SAX as an optimization later (but you'll probably want to go binary by then).

Follow this pattern when parsing using StAX:

public MyDataBindingObject parse(..) { // provide input stream, reader, etc

        // set up parser
        // read the root tag to get to level 1
        XMLStreamReader reader = ....;

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
              // check if correct root tag
              break;
            }

            // add check for document end if you want to

        } while(reader.hasNext());

        MyDataBindingObject object = new MyDataBindingObject();
        // read root attributes if any

        int level = 1; // we are at level 1, since we have read the document header

        do {
            int event = reader.next();
            if(event == XMLStreamConstants.START_ELEMENT) {
                level++;
                // do stateful stuff here

                // for child logic:
                if(reader.getLocalName().equals("Whatever1")) {
                    WhateverObject child = parseSubTreeForWhatever(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }

                // alternatively, faster
                if(level == 2) {
                    parseSubTreeForWhateverAtRelativeLevel2(reader);
                    level --; // read from level 1 to 0 in submethod.

                    // do something with the result of subtree
                    object.setWhatever(child);
                }


            } else if(event == XMLStreamConstants.END_ELEMENT) {
                level--;
                // do stateful stuff here, too
            }

        } while(level > 0);

        return object;
}

So the submethod uses about the same approach, i.e. counting level:

private MySubTreeObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySubTreeObject object = new MySubTreeObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;
            // do stateful stuff here

            // for child logic:
            if(reader.getLocalName().equals("Whatever2")) {
                MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }

            // alternatively, faster, but less strict
            if(level == 2) {
              MyWhateverObject child = parseMySubelementTree(reader);
                level --; // read from level 1 to 0 in submethod.

                // use subtree object somehow
                object.setWhatever(child);
            }


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    return object;
}

And then eventually you reach a level in which you will read the base types.

private MySetterGetterObject parseSubTree(XMLStreamReader reader) throws XMLStreamException {

    MySetterGetterObject myObject = new MySetterGetterObject();
    // read element attributes if any

    int level = 1;
    do {
        int event = reader.next();
        if(event == XMLStreamConstants.START_ELEMENT) {
            level++;

            // assume <FirstName>Thomas</FirstName>:
            if(reader.getLocalName().equals("FirstName")) {
               // read tag contents
               String text = reader.getElementText()
               if(text.length() > 0) {
                    myObject.setName(text)
               }
               level--;

            } else if(reader.getLocalName().equals("LastName")) {
               // etc ..
            } 


        } else if(event == XMLStreamConstants.END_ELEMENT) {
            level--;
            // do stateful stuff here, too
        }

    } while(level > 0);

    // verify that all required fields in myObject are present

    return myObject;
}

This is quite straightforward and there is no room for misunderstandings. Just remember to decrement level correctly:

A. after you expected characters but got an END_ELEMENT in some tag which should contain chars (in the above pattern):

<Name>Thomas</Name>

was instead

<Name></Name>

The same is true for a missing subtree too, you get the idea.

B. after calling subparsing methods, which are called on start elements, and returns AFTER the corresponding end element, i.e. the parser is at one level lower than before the method call (the above pattern).

Note how this approach totally ignores 'ignorable' whitespace too, for more robust implementation.

Parsers
Go with Woodstox for most features or Aaalto-xml for speed.

Voncile answered 22/9, 2011 at 21:36 Comment(4)
In your opening statement it reads "...whereas in SAX...". Is this a typo? ("SAX" instead of "StAX") In any case thanks for the answer. If I understand you correctly, you're saying that the implicit state in the SAX approach is a benefit compared to the need for tracking your xml-tree location in the StAX approach.Clausen
Thanks for the (now even more elaborate) answer. I'm afraid I still don't see what would be a good reason for using SAX instead of StAX. Your answer is a good explanation of how both processors work.Clausen
For simple documents, they are the same. Look at for example this schema: mpeg.chiariglione.org/technologies/mpeg-21/mp21-did/index.htm and StAX will be more practical.Voncile
In a nutshell, since you are already writing your code, you understand what part of the document you are parsing, i.e. all logic to map a SAX event to is correct code, is wasted.Voncile
A
25

To generalize a bit, I think StAX can be as efficient as SAX. With the improved design of StAX I can't really find any situation where SAX parsing would be preferred, unless working with legacy code.

EDIT: According to this blog Java SAX vs. StAX StAXoffer no schema validation.

Aeronautics answered 22/9, 2011 at 21:48 Comment(2)
it's not too hard to add validation on top of stax. implemented that myself the other day.Chellean
More detail on validation: https://mcmap.net/q/243366/-stax-xml-validationBetony
C
17

@Rinke: I guess only time I think of preferring SAX over STAX in case when you don't need to handle/process XML content; for e.g. only thing you want to do is check for well-formedness of incoming XML and just want to handle errors if it has...in this case you can simply call parse() method on SAX parser and specify error handler to handle any parsing problem....so basically STAX is definitely preferrable choice in scenarios where you want to handle content becasue SAX content handler is too difficult to code...

one practical example of this case may be if you have series of SOAP nodes in your enterprise system and an entry level SOAP node only lets those SOAP XML pass thru next stage which are well-formedness, then I don't see any reason why I would use STAX. I would just use SAX.

Cockscomb answered 6/10, 2011 at 7:41 Comment(1)
I selected this answer as the best one so far. Although it's a good answer, I don't feel it's 100% authorative and clear however. New answers are welcome.Clausen
C
1

It's all a balance.

You can turn a SAX parser into a pull parser using a blocking queue and some thread trickery so, to me, there is much less difference than there first seems.

I believe currently StAX needs to be packaged through a third-party jar while SAX comes free in javax.

I recently chose SAX and built a pull parser around it so I did not need to rely on a third-party jar.

Future versions of Java will almost certainly contain a StAX implementation so the problem goes away.

Carbonation answered 10/10, 2011 at 13:49 Comment(1)
Java SE 6 does include StAX. But e.g. android implementation does not include it.Polygnotus
B
0

StAX enables you to create bidirectional XML parsers that are fast. It proves a better alternative to other methods, such as DOM and SAX, both in terms of performance and usability

You can read more about StAX in Java StAX Tutorials

Bellabelladonna answered 1/4, 2015 at 9:59 Comment(0)
P
-2

Most of the information provided by those answers are somewhat outdated... there have been a comprehensive study of all XML parsing libs in this 2013 research paper... read it and you will easily see the clear winner (hint: there is only one true winner)...

http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Padova answered 19/4, 2016 at 20:26 Comment(6)
I read the paper, the winner is StAX using the cursor API as in XMLStreamReader.Dm
very funny :), you mean the winner of tortoise race :)Padova
I just reread the paper, and yes StaX is superior to vtd, faster and less memory consumption. So what is your point?Dm
the winner is stAX in what way? which part of the paper are you referring to? modifying document, or selecting or differentiation? apparently the author of the paper drew a different conclusion. but they could be totally wrong...Padova
e.g. page 80: According to results (figure 11 and figure 12) we can see that StAX is the API that has the better performance, followed by VTD. However, VTD consumes a considerable amount of memory. Memory consumption can be a bottleneck for environments that provide limited capabilities.Dm
Some operations are faster in VTD, e.g. difference operation. So if you need that consider using VTDDm

© 2022 - 2024 — McMap. All rights reserved.