Loading local chunks in DOM while parsing a large XML file in SAX (Java)

Asked 3/11, 2011 at 16:51 Answered 9/11, 2011 at 17:6

I've an xml file that I would avoid having to load all in memory. As everyone know, for such a file I better have to use a SAX parser (which will go along the file and call for events if something relevant is found.)

My current problem is that I would like to process the file "by chunk" which means:

Parse the file and find a relevant tag (node)
Load this tag entirely in memory (like we would do it in DOM)
Do the process of this entity (that local chunk)
When I'm done with the chunk, release it and continue to 1. (until "end of file")

In a perfect world I'm searching some something like this:

// 1. Create a parser and set the file to load
      IdealParser p = new IdealParser("BigFile.xml");
// 2. Set an XPath to define the interesting nodes
      p.setRelevantNodesPath("/path/to/relevant/nodes");
// 3. Add a handler to callback the right method once a node is found
      p.setHandler(new Handler(){
// 4. The method callback by the parser when a relevant node is found
      void aNodeIsFound(saxNode aNode)
   {
   // 5. Inflate the current node i.e. load it (and all its content) in memory
         DomNode d = aNode.expand();
   // 6. Do something with the inflated node (method to be defined somewhere)
         doThingWithNode(d);
    }
   });
// 7. Start the parser
      p.start();

I'm currently stuck on how to expand a "sax node" (understand me…) efficiently.

Is there any Java framework or library relevant to this kind of task?

Licht answered 3/11, 2011 at 16:51 Comment(0)

ok thanks to your pieces of code, I finally end up with my solution:

Usage is quite intuitive:

try 
        {
            /* CREATE THE PARSER  */
            XMLParser parser      = new XMLParser();
            /* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
            XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
            /* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
            XMLHandler handler    = new XMLHandler()
            {
                public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
                {
                    // DO SOMETHING WITH THE FOUND XML NODE
                    System.out.println("Node found");
                    System.out.println(node.toString());
                }
            };
            /* ATTACH THE FILTER WITH THE HANDLER */
            parser.addFilterWithHandler(filter, handler);
            /* SET THE FILE TO PARSE */
            parser.setFilePath("/path/to/bigfile.xml");
            /* RUN THE PARSER */
            parser.parse();
        } 
        catch (Exception ex) 
        {
            ex.printStackTrace();
        }

Note:

I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely

All the of the code is here:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;

/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {

    abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}

/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}

/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {

    abstract boolean isRelevant(Stack fullPath);
}

/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {

    Pattern relevantExpression;

    XMLRegexFilter(String filterRules) {
        relevantExpression = Pattern.compile(filterRules);
    }

    /* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
     * OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
    @Override
    public boolean isRelevant(Stack fullPath) {
        /* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
         * ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
         * FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
        String stackPath = XMLParser.StackToString(fullPath);
        Matcher m = relevantExpression.matcher(stackPath);
        return  m.matches();
    }
}

/* THE MAIN PARSER'S CLASS */
public class XMLParser {

    HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
    HashMap<Integer, Integer> feedingStreams;
    Stack<HashMap> currentStack;
    String filePath;

    XMLParser() {
        currentStack   = new <HashMap>Stack();
        filterHandler  = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
        feedingStreams = new <Integer, Integer>HashMap();
    }

    public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
        filterHandler.put(f, h);
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

    /* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT 
     * I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
     * DO SO
     */
    public static String StackToString(Stack<HashMap> s) {
        int k = s.size();
        if (k == 0) {
            return null;
        }
        StringBuilder out = new StringBuilder();
        out.append(s.get(0).get("tag"));
        for (int x = 1; x < k; ++x) {
            HashMap node = s.get(x);
            out.append('/').append(node.get("tag"));
            /* 
            // UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH

            ArrayList <String[]>attributes = (ArrayList)node.get("attr");
            if (attributes.size()>0)
            {
            out.append("[");
            for (int i = 0 ; i<attributes.size(); i++)
            {
            String[]keyValuePair = attributes.get(i);
            if (i>0) out.append(",");
            out.append(keyValuePair[0]);
            out.append("=\"");
            out.append(keyValuePair[1]);
            out.append("\"");
            }
            out.append("]");
            }*/
        }
        return out.toString();
    }

    /*
     * ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
     * WE THEN RETRIEVE THE DATA FROM IT.
     */
    private StringBuilder getChunk(int from, int to) throws Exception {
        int length = to - from;
        FileReader f = new FileReader(filePath);
        BufferedReader br = new BufferedReader(f);
        br.skip(from);
        char[] readb = new char[length];
        br.read(readb, 0, length);
        StringBuilder b = new StringBuilder();
        b.append(readb);
        return b;
    }
    /* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
    public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
        HashMap h = new HashMap();
        ArrayList attributes = new ArrayList();

        for (int i = 0; i < xsr.getAttributeCount(); i++) {
            String[] s = new String[2];
            s[0] = xsr.getAttributeName(i).toString();
            s[1] = xsr.getAttributeValue(i);
            attributes.add(s);
        }

        h.put("tag", xsr.getName());
        h.put("attr", attributes);

        return h;
    }

    public void parse() throws Exception {
        FileReader f         = new FileReader(filePath);
        XMLInputFactory xif  = XMLInputFactory.newInstance();
        XMLStreamReader xsr  = xif.createXMLStreamReader(f);
        Location previousLoc = xsr.getLocation();

        while (xsr.hasNext()) {
            switch (xsr.next()) {
                case XMLStreamConstants.START_ELEMENT:
                    currentStack.add(XSRNode2HashMap(xsr));
                    for (XMLStackFilter filter : filterHandler.keySet()) {
                        if (filter.isRelevant(currentStack)) {
                            feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
                        }
                    }
                    previousLoc = xsr.getLocation();
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    Integer stream = null;
                    if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
                        // FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
                        for (XMLStackFilter filter : filterHandler.keySet()) {
                            if (filter.isRelevant(currentStack)) {
                                XMLNodeFoundNotifier h = filterHandler.get(filter);

                                StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
                                h.nodeFound(aChunk, filter);
                            }
                        }
                        feedingStreams.remove(currentStack.hashCode());
                    }
                    previousLoc = xsr.getLocation();
                    currentStack.pop();
                    break;
                default:
                    break;
            }
        }
    }
}

Licht answered 9/11, 2011 at 17:6 Comment(0)

UPDATE

You could also just use the javax.xml.xpath APIs:

package forum7998733;

import java.io.FileReader;
import javax.xml.xpath.*;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class XPathDemo {

    public static void main(String[] args) throws Exception {
        XPathFactory xpf = XPathFactory.newInstance();
        XPath xpath = xpf.newXPath();
        InputSource xml = new InputSource(new FileReader("BigFile.xml"));
        Node result = (Node) xpath.evaluate("/path/to/relevant/nodes", xml, XPathConstants.NODE);
        System.out.println(result);
    }

}

Below is a sample of how it could be done with StAX.

input.xml

Below is some sample XML:

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

Demo

In this example a StAX XMLStreamReader is used to find the node that will be converted to a DOM. In this example we convert each statement fragment to a DOM, but your navigation algorithm could be more advanced.

package forum7998733;

import java.io.FileReader;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum7998733/input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult domResult = new DOMResult();
            t.transform(new StAXSource(xsr), domResult);

            DOMSource domSource = new DOMSource(domResult.getNode());
            StreamResult streamResult = new StreamResult(System.out);
            t.transform(domSource, streamResult);
        }
    }

}

Output

<?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="123">
      ...stuff...
   </statement><?xml version="1.0" encoding="UTF-8" standalone="no"?><statement account="456">
      ...stuff...
   </statement>

Noisette answered 3/11, 2011 at 19:3 Comment(6)

Doughan, you are so mean! I was working on a demo class while you posted this! – Chiaroscuro 3/11, 2011 at 19:13

@Chiaroscuro - You'll notice that my demo code looks a lot like the demo code from one of my other similar answers: #5170478 :) – Noisette 3/11, 2011 at 19:23

That's even meaner :D But hey, what kind of programmers would we be without reusing where appropriate. – Chiaroscuro 3/11, 2011 at 19:26

Hi, I just tried your code, but the while loop is simply never called. It looks like if the first "nextTag" will jump to the end of the document. To be continued… :-) – Licht 4/11, 2011 at 10:48

@FlavienVolken - I have updated my answer slightly tweaking the demo code to produce output (also added to answer). The sample code should work as given. – Noisette 4/11, 2011 at 13:13

Hi, the XPath would be perfect but unfortunately my file is too big to be used with. Increasing the JVM memory could be a solution but I would prefer to avoid this. I then just searched for a lib able to process an XPath and return the result "one node at a time" and discovered that [vtd-xml.sourceforge.net/] (VTD-XML) does it. I will then check and possibly have your 2 solutions in one, I will post an answer if I can get something working properly… ;-) – Licht 4/11, 2011 at 18:7

It could be done with SAX... But I think the newer StAX (Streaming API for XML) will serve your purpose better. You could create an XMLEventReader and use that to parse your file, detecting which nodes adhere to one of your criteria. For simple path-based selection (not really XPath, but some simple / delimited path) you'd need to maintain a path to your current node by adding entries to a String on new elements or cutting of entries on an end tag. A boolean flag can suffice to maintain whether you're currently in "relevant mode" or not.

As you obtain XMLEvents from your reader, you could copy the relevant ones over to an XMLEventWriter that you've created on some suitable placeholder, like a StringWriter or ByteArrayOutputStream. Once you've completed the copying for some XML extract that forms a "subdocument" of what you wish to build a DOM for, simply supply your placeholder to a DocumentBuilder in a suitable form.

The limitation here is that you're not harnessing all the power of the XPath language. If you wish to take stuff like node position into account, you'd have to foresee that in your own path. Perhaps someone knows of a good way of integrating a true XPath implementation into this.

StAX is really nice in that it gives you control over the parsing, rather than using some callback interface through a handler like SAX.

There's yet another alternative: using XSLT. An XSLT stylesheet is the ideal way to filter out only relevant stuff. You could transform your input once to obtain the required fragments and process those. Or run multiple stylesheets over the same input to get the desired extract each time. An even nicer (and more efficient) solution, however, would be the use of extension functions and/or extension elements.

Extension functions can be implemented in a way that's independent from the XSLT processor being used. They're fairly straightforward to use in Java and I know for a fact that you can use them to pass complete XML extracts to a method, because I've done so already. Might take some experimentation, but it's a powerful mechanism. A DOM extract (or node) is probably one of the accepted parameter types for such a method. That'd leave the document building up to the XSLT processor which is even easier.

Extension elements are also very useful, but I think they need to be used in an implementation-specific manner. If you're okay with tying yourself to a specific JAXP setup like Xerces + Xalan, they might be the answer.

When going for XSLT, you'll have all the advantages of a full XPath 1.0 implementation, plus the peace of mind that comes from knowing XSLT is in really good shape in Java. It limits the building of the input tree to those nodes that are needed at any time and is blazing fast because the processors tend to compile stylesheets into Java bytecode rather than interpreting them. It is possible that using compilation instead of interpretation loses the possibility of using extension elements, though. Not certain about that. Extension functions are still possible.

Whatever way you choose, there's so much out there for XML processing in Java that you'll find plenty of help in implementing this, should you have no luck in finding a ready-made solution. That'd be the most obvious thing, of course... No need to reinvent the wheel when someone did the hard work.

Good luck!

EDIT: because I'm actually not feeling depressed for once, here's a demo using the StAX solution I whipped up. It's certainly not the cleanest code, but it'll give you the basic idea:

package staxdom;

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.Stack;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class DOMExtractor {

    private final Set<String> paths;
    private final XMLInputFactory inputFactory;
    private final XMLOutputFactory outputFactory;
    private final DocumentBuilderFactory docBuilderFactory;
    private final Stack<QName> activeStack = new Stack<QName>();

    private boolean active = false;
    private String currentPath = "";

    public DOMExtractor(final Set<String> paths) {

        this.paths = Collections.unmodifiableSet(new HashSet<String>(paths));
        inputFactory = XMLInputFactory.newFactory();
        outputFactory = XMLOutputFactory.newFactory();
        docBuilderFactory = DocumentBuilderFactory.newInstance();

    }

    public void parse(final InputStream input) throws XMLStreamException, ParserConfigurationException, SAXException, IOException {

        final XMLEventReader reader = inputFactory.createXMLEventReader(input);
        XMLEventWriter writer = null;
        StringWriter buffer = null;
        final DocumentBuilder builder = docBuilderFactory.newDocumentBuilder();

        XMLEvent currentEvent = reader.nextEvent();

        do {

            if(active)
                writer.add(currentEvent);

            if(currentEvent.isEndElement()) {

                if(active) {

                    activeStack.pop();

                    if(activeStack.isEmpty()) {
                        writer.flush();
                        writer.close();
                        final Document doc;
                        final StringReader docReader = new StringReader(buffer.toString());
                        try {
                            doc = builder.parse(new InputSource(docReader));
                        } finally {
                            docReader.close();
                        }
                        //TODO: use doc
                        //Next bit is only for demo...
                        outputDoc(doc);
                        active = false;
                        writer = null;
                        buffer = null;
                    }

                }

                int index;
                if((index = currentPath.lastIndexOf('/')) >= 0)
                    currentPath = currentPath.substring(0, index);

            } else if(currentEvent.isStartElement()) {

                final StartElement start = (StartElement)currentEvent;
                final QName qName = start.getName();
                final String local = qName.getLocalPart();

                currentPath += "/" + local;

                if(!active && paths.contains(currentPath)) {

                    active = true;

                    buffer = new StringWriter();
                    writer = outputFactory.createXMLEventWriter(buffer);

                    writer.add(currentEvent);

                }

                if(active)
                    activeStack.push(qName);

            }

            currentEvent = reader.nextEvent();

        } while(!currentEvent.isEndDocument());

    }

    private void outputDoc(final Document doc) {


        try {
            final Transformer t = TransformerFactory.newInstance().newTransformer();
            t.transform(new DOMSource(doc), new StreamResult(System.out));
            System.out.println("");
            System.out.println("");
        } catch(TransformerException ex) {
            ex.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final Set<String> paths = new HashSet<String>();
        paths.add("/root/one");
        paths.add("/root/three/embedded");

        final DOMExtractor me = new DOMExtractor(paths);

        InputStream stream = null;
        try {
            stream = DOMExtractor.class.getResourceAsStream("sample.xml");
            me.parse(stream);
        } catch(final Exception e) {
            e.printStackTrace();
        } finally {
            if(stream != null)
                try {
                    stream.close();
                } catch(IOException ex) {
                    ex.printStackTrace();
                }
        }

    }

}

And the sample.xml file (should be in the same package):

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <one>
        <two>this is text</two>
        look, I can even handle mixed!
    </one>
    ... not sure what to do with this, though
    <two>
        <willbeignored/>
    </two>
    <three>
        <embedded>
            <and><here><we><go>
                Creative Commons Legal Code

                Attribution 3.0 Unported

                    CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
                    LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
                    ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
                    INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
                    REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR
                    DAMAGES RESULTING FROM ITS USE.

                License

                THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
                COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
                COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
                AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

                BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
                TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
                BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
                CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
                CONDITIONS.
            </go></we></here></and>
        </embedded>
    </three>
</root>

EDIT 2: Just noticed in Blaise Doughan's answer that there's a StAXSource. That'll be even more efficient. Use that if you're going with StAX. Will eliminate the need to keep some buffer. StAX allows you to "peek" at the next event, so you can check if it's a start element with the right path without consuming it before passing it into the transformer .

Chiaroscuro answered 3/11, 2011 at 18:11 Comment(0)

ok thanks to your pieces of code, I finally end up with my solution:

Usage is quite intuitive:

try 
        {
            /* CREATE THE PARSER  */
            XMLParser parser      = new XMLParser();
            /* CREATE THE FILTER (THIS IS A REGEX (X)PATH FILTER) */
            XMLRegexFilter filter = new XMLRegexFilter("statements/statement");
            /* CREATE THE HANDLER WHICH WILL BE CALLED WHEN A NODE IS FOUND */
            XMLHandler handler    = new XMLHandler()
            {
                public void nodeFound(StringBuilder node, XMLStackFilter withFilter)
                {
                    // DO SOMETHING WITH THE FOUND XML NODE
                    System.out.println("Node found");
                    System.out.println(node.toString());
                }
            };
            /* ATTACH THE FILTER WITH THE HANDLER */
            parser.addFilterWithHandler(filter, handler);
            /* SET THE FILE TO PARSE */
            parser.setFilePath("/path/to/bigfile.xml");
            /* RUN THE PARSER */
            parser.parse();
        } 
        catch (Exception ex) 
        {
            ex.printStackTrace();
        }

Note:

I made a XMLNodeFoundNotifier and a XMLStackFilter interface to easily integrate or build your own handler / filter.
Normally you should be able to parse very large files with this class. Only the returned nodes are actually loaded into memory.
You can enable attributes support in uncommenting the right part in the code, I disabled it for simplicity reasons.
You can use as many filters per handler as you need and conversely

All the of the code is here:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Stack;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.stream.*;

/* IMPLEMENT THIS TO YOUR CLASS IN ORDER TO TO BE NOTIFIED WHEN A NODE IS FOUND*/
interface XMLNodeFoundNotifier {

    abstract void nodeFound(StringBuilder node, XMLStackFilter withFilter);
}

/* A SMALL HANDER USEFULL FOR EXPLICIT CLASS DECLARATION */
abstract class XMLHandler implements XMLNodeFoundNotifier {
}

/* INTERFACE TO WRITE YOUR OWN FILTER BASED ON THE CURRENT NODES STACK (PATH)*/
interface XMLStackFilter {

    abstract boolean isRelevant(Stack fullPath);
}

/* A VERY USEFULL FILTER USING REGEX AS THE PATH FILTER */
class XMLRegexFilter implements XMLStackFilter {

    Pattern relevantExpression;

    XMLRegexFilter(String filterRules) {
        relevantExpression = Pattern.compile(filterRules);
    }

    /* HERE WE ARE ARE ASK TO TELL IF THE CURRENT STACK (LIST OF NODES) IS RELEVANT
     * OR NOT ACCORDING TO WHAT WE WANT. RETURN TRUE IF THIS IS THE CASE */
    @Override
    public boolean isRelevant(Stack fullPath) {
        /* A POSSIBLE CLEVER WAY COULD BE TO SERIALIZE THE WHOLE PATH (INCLUDING
         * ATTRIBUTES) TO A STRING AND TO MATCH IT WITH A REGEX BEING THE FILTER
         * FOR NOW StackToString DOES NOT SERIALIZE ATTRIBUTES */
        String stackPath = XMLParser.StackToString(fullPath);
        Matcher m = relevantExpression.matcher(stackPath);
        return  m.matches();
    }
}

/* THE MAIN PARSER'S CLASS */
public class XMLParser {

    HashMap<XMLStackFilter, XMLNodeFoundNotifier> filterHandler;
    HashMap<Integer, Integer> feedingStreams;
    Stack<HashMap> currentStack;
    String filePath;

    XMLParser() {
        currentStack   = new <HashMap>Stack();
        filterHandler  = new <XMLStackFilter, XMLNodeFoundNotifier> HashMap();
        feedingStreams = new <Integer, Integer>HashMap();
    }

    public void addFilterWithHandler(XMLStackFilter f, XMLNodeFoundNotifier h) {
        filterHandler.put(f, h);
    }

    public void setFilePath(String filePath) {
        this.filePath = filePath;
    }

    /* CONVERT A STACK OF NODES TO A REGULAR PATH STRING. NOTE THAT PER DEFAULT 
     * I DID NOT ADDED THE ATTRIBUTES INTO THE PATH. UNCOMENT THE LINKS ABOVE TO
     * DO SO
     */
    public static String StackToString(Stack<HashMap> s) {
        int k = s.size();
        if (k == 0) {
            return null;
        }
        StringBuilder out = new StringBuilder();
        out.append(s.get(0).get("tag"));
        for (int x = 1; x < k; ++x) {
            HashMap node = s.get(x);
            out.append('/').append(node.get("tag"));
            /* 
            // UNCOMMENT THIS TO ADD THE ATTRIBUTES SUPPORT TO THE PATH

            ArrayList <String[]>attributes = (ArrayList)node.get("attr");
            if (attributes.size()>0)
            {
            out.append("[");
            for (int i = 0 ; i<attributes.size(); i++)
            {
            String[]keyValuePair = attributes.get(i);
            if (i>0) out.append(",");
            out.append(keyValuePair[0]);
            out.append("=\"");
            out.append(keyValuePair[1]);
            out.append("\"");
            }
            out.append("]");
            }*/
        }
        return out.toString();
    }

    /*
     * ONCE A NODE HAS BEEN SUCCESSFULLY FOUND, WE GET THE DELIMITERS OF THE FILE
     * WE THEN RETRIEVE THE DATA FROM IT.
     */
    private StringBuilder getChunk(int from, int to) throws Exception {
        int length = to - from;
        FileReader f = new FileReader(filePath);
        BufferedReader br = new BufferedReader(f);
        br.skip(from);
        char[] readb = new char[length];
        br.read(readb, 0, length);
        StringBuilder b = new StringBuilder();
        b.append(readb);
        return b;
    }
    /* TRANSFORMS AN XSR NODE TO A HASHMAP NODE'S REPRESENTATION */
    public HashMap XSRNode2HashMap(XMLStreamReader xsr) {
        HashMap h = new HashMap();
        ArrayList attributes = new ArrayList();

        for (int i = 0; i < xsr.getAttributeCount(); i++) {
            String[] s = new String[2];
            s[0] = xsr.getAttributeName(i).toString();
            s[1] = xsr.getAttributeValue(i);
            attributes.add(s);
        }

        h.put("tag", xsr.getName());
        h.put("attr", attributes);

        return h;
    }

    public void parse() throws Exception {
        FileReader f         = new FileReader(filePath);
        XMLInputFactory xif  = XMLInputFactory.newInstance();
        XMLStreamReader xsr  = xif.createXMLStreamReader(f);
        Location previousLoc = xsr.getLocation();

        while (xsr.hasNext()) {
            switch (xsr.next()) {
                case XMLStreamConstants.START_ELEMENT:
                    currentStack.add(XSRNode2HashMap(xsr));
                    for (XMLStackFilter filter : filterHandler.keySet()) {
                        if (filter.isRelevant(currentStack)) {
                            feedingStreams.put(currentStack.hashCode(), new Integer(previousLoc.getCharacterOffset()));
                        }
                    }
                    previousLoc = xsr.getLocation();
                    break;

                case XMLStreamConstants.END_ELEMENT:
                    Integer stream = null;
                    if ((stream = feedingStreams.get(currentStack.hashCode())) != null) {
                        // FIND ALL THE FILTERS RELATED TO THIS FeedingStreem AND CALL THEIR HANDLER.
                        for (XMLStackFilter filter : filterHandler.keySet()) {
                            if (filter.isRelevant(currentStack)) {
                                XMLNodeFoundNotifier h = filterHandler.get(filter);

                                StringBuilder aChunk = getChunk(stream.intValue(), xsr.getLocation().getCharacterOffset());
                                h.nodeFound(aChunk, filter);
                            }
                        }
                        feedingStreams.remove(currentStack.hashCode());
                    }
                    previousLoc = xsr.getLocation();
                    currentStack.pop();
                    break;
                default:
                    break;
            }
        }
    }
}

Licht answered 9/11, 2011 at 17:6 Comment(0)

A little while since i did SAX, but what you want to do is process each of the tags until you find the end tag for the group you want to process, then run your process, clear it out and look for the next start tag.

Nkrumah answered 3/11, 2011 at 17:23 Comment(1)

yep, the question was more likely "Is there something which does it for me", I added the answer which fulfilled my needs :-) Thanks – Licht 9/11, 2011 at 17:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags