Using SAX to parse common XML elements
Asked Answered
C

2

11

I'm currently using SAX (Java) to parse a a handful of different XML documents, with each document representing different data and having slightly different structures. For this reason, each XML document is handled by a different SAX class (subclassing DefaultHandler).

However, there are some XML structures that can appear in all these different documents. Ideally, I'd like to tell the parser "Hey, when you reach a complex_node element, just use ComplexNodeHandler to read it, and give me back the result. If you reach a some_other_node, use OtherNodeHandler to read it and give me back that result".

However, I can't see an obvious way to do this.

Should I simply just make a monolithic handler class that can read all the different documents I have (and eradicate duplication of code), or is there a smarter way to handle this?

Connors answered 4/8, 2010 at 12:57 Comment(3)
I'm hoping/sure I've just missed some painfully obvious solution!Connors
is SAX a requirement? how about using xpath with DOM, XOM or vtd-xmL?Otology
Because SAX is the fastest and uses the least memory, which is important on mobile devices (which I neglected to mention when I asked this question originally).Connors
S
16

Below is an answer I made to a similar question (Skipping nodes with sax). It demonstrates how to swap content handlers on an XMLReader.

In this example the swapped in ContentHandler simply ignores all events until it gives up control, but you could adapt the concept easily.


You could do something like the following:

import javax.xml.parsers.SAXParser; 
import javax.xml.parsers.SAXParserFactory; 
import org.xml.sax.XMLReader; 

public class Demo { 

    public static void main(String[] args) throws Exception { 
        SAXParserFactory spf = SAXParserFactory.newInstance(); 
        SAXParser sp = spf.newSAXParser(); 
        XMLReader xr = sp.getXMLReader(); 
        xr.setContentHandler(new MyContentHandler(xr)); 
        xr.parse("input.xml"); 
    } 
} 

MyContentHandler

This class is responsible for processing your XML document. When you hit a node you want to ignore you can swap in the IgnoringContentHandler which will swallow all events for that node.

import org.xml.sax.Attributes; 
import org.xml.sax.ContentHandler; 
import org.xml.sax.Locator; 
import org.xml.sax.SAXException; 
import org.xml.sax.XMLReader; 

public class MyContentHandler implements ContentHandler { 

    private XMLReader xmlReader; 

    public MyContentHandler(XMLReader xmlReader) { 
        this.xmlReader = xmlReader; 
    } 

    public void setDocumentLocator(Locator locator) { 
    } 

    public void startDocument() throws SAXException { 
    } 

    public void endDocument() throws SAXException { 
    } 

    public void startPrefixMapping(String prefix, String uri) 
            throws SAXException { 
    } 

    public void endPrefixMapping(String prefix) throws SAXException { 
    } 

    public void startElement(String uri, String localName, String qName, 
            Attributes atts) throws SAXException { 
        if("sodium".equals(qName)) { 
            xmlReader.setContentHandler(new IgnoringContentHandler(xmlReader, this)); 
        } else { 
            System.out.println("START " + qName); 
        } 
    } 

    public void endElement(String uri, String localName, String qName) 
            throws SAXException { 
        System.out.println("END " + qName); 
    } 

    public void characters(char[] ch, int start, int length) 
            throws SAXException { 
        System.out.println(new String(ch, start, length)); 
    } 

    public void ignorableWhitespace(char[] ch, int start, int length) 
            throws SAXException { 
    } 

    public void processingInstruction(String target, String data) 
            throws SAXException { 
    } 

    public void skippedEntity(String name) throws SAXException { 
    } 

} 

IgnoringContentHandler

When the IgnoringContentHandler is done swallowing events it passes control back to your main ContentHandler.

import org.xml.sax.Attributes; 
import org.xml.sax.ContentHandler; 
import org.xml.sax.Locator; 
import org.xml.sax.SAXException; 
import org.xml.sax.XMLReader; 

public class IgnoringContentHandler implements ContentHandler { 

    private int depth = 1; 
    private XMLReader xmlReader; 
    private ContentHandler contentHandler; 

    public IgnoringContentHandler(XMLReader xmlReader, ContentHandler contentHandler) { 
        this.contentHandler = contentHandler; 
        this.xmlReader = xmlReader; 
    } 

    public void setDocumentLocator(Locator locator) { 
    } 

    public void startDocument() throws SAXException { 
    } 

    public void endDocument() throws SAXException { 
    } 

    public void startPrefixMapping(String prefix, String uri) 
            throws SAXException { 
    } 

    public void endPrefixMapping(String prefix) throws SAXException { 
    } 

    public void startElement(String uri, String localName, String qName, 
            Attributes atts) throws SAXException { 
        depth++; 
    } 

    public void endElement(String uri, String localName, String qName) 
            throws SAXException { 
        depth--; 
        if(0 == depth) { 
           xmlReader.setContentHandler(contentHandler); 
        } 
    } 

    public void characters(char[] ch, int start, int length) 
            throws SAXException { 
    } 

    public void ignorableWhitespace(char[] ch, int start, int length) 
            throws SAXException { 
    } 

    public void processingInstruction(String target, String data) 
            throws SAXException { 
    } 

    public void skippedEntity(String name) throws SAXException { 
    } 

} 
Spiral answered 4/8, 2010 at 19:48 Comment(8)
Hmm, didn't realise XMLReader could be changed on-the-fly in that way. Definitely seems like the neatest way to handle it.Connors
XMLReader was designed to do just that, refer to download-llnw.oracle.com/javase/6/docs/api/org/xml/sax/… , we make use of this in our JAXB implementation MOXy when doing SAX processing we have a ContentHandler per object being built.Spiral
@Blaise Doughan First of all thank you for this solution it is exactly what I've been looking for. I have a question tho. Is there any special thought behind evaluating the depth of the structure to know when to pass back to the main content handler? Is there any problem in using the endDocument() method for this purpose?Storehouse
@Octavian Damiean, the parser will only call endDocument once. This is why the depth variable is needed.Spiral
@Blaise Doughan, Ah perhaps for my purpose it's OK that way cause I have only standalone XMLs. So I'm parsing from the beginning till the end but just branched for different standalone XMLs. I guess you need the depth variable if you want to skip parts of an XML based on a certain tag. Thanks for the answer.Storehouse
@Octavian Damiean this solution is for switching content handlers during a single parse operation. When you switch in a content handler you need to keep track of when to swap it back out.Spiral
@Blaise Doughan, Yes I see now. That's what I meant. My problem is not a single parse operation but multiple different ones with differing structure. But this solution works for me too. I just don't have to swap back out. :)Storehouse
@OctavianDamiean This seems to be exactly what I'm looking for. But once you swap handlers to one that doesn't ignore but parses into an object how does one get that parsed object back? I can't picture it.Lisalisabet
A
1

You could have one handler (ComplexNodeHandler) that handles only some parts of a document (complex_node) and passes all other pieces to another handler. The constructor for ComplexNodeHandler would take the other handler as a parameter. I mean something like this:

class ComplexNodeHandler {

    private ContentHandler handlerForOtherNodes;

    public ComplexNodeHandler(ContentHandler handlerForOtherNodes) {
         this.handlerForOtherNodes = handlerForOtherNodes;
    }

    ...

    public startElement(String uri, String localName, String qName, Attributes atts) {
        if (currently in complex node) {
            [handle complex node data] 
        } else {
            // pass the event to the document specific handler
            handlerForOtherNodes.startElement(uri, localName, qName, atts);
       }
    } 

    ...

}

There could be better alternatives still since I'm not that familiar with SAX. Writing a base handler for the common parts and inheriting it could work too but I'm not sure if using inheritance here is a good idea.

Abra answered 4/8, 2010 at 15:15 Comment(1)
I considered this, but quickly determined it would become rather complex. I'd have to forward calls from not just startElement, but endElement, characters, and the error handlers.Connors

© 2022 - 2024 — McMap. All rights reserved.