Reading a big XML file using stax and dom
Asked Answered
D

3

11

I need to read several big (200Mb-500Mb) XML files, so I want to use StaX. My system has two modules - one to read the file ( with StaX ); another module ( 'parser' module ) suppose to get a single entry of that XML and parse it using DOM. My XML files don't have a certain structure - so I cannot use JaxB. How can I pass the 'parser' module a specific entry that I want it to parse? For example:

<Items>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
   <Item>
        <name> .... </name>
        <price> ... </price>
   </Item>
</Items>

I want to use StaX to parse that file - but each 'item' entry will be passed to the 'parser' module.

Edit:
After a little more reading - I think I need a library that reads an XML file using stream - but parse each entry using DOM. Is there such a thing?

Daynadays answered 21/2, 2012 at 15:7 Comment(0)
F
20

You could use a StAX (javax.xml.stream) parser and transform (javax.xml.transform) each section to a DOM node (org.w3c.dom):

import java.io.*;
import javax.xml.stream.*;
import javax.xml.transform.*;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.dom.DOMResult;
import org.w3c.dom.*

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            DOMResult result = new DOMResult();
            t.transform(new StAXSource(xsr), result);
            Node domNode = result.getNode();
        }
    }

}

Also see:

Fought answered 21/2, 2012 at 17:27 Comment(4)
Thanks, It works great for me! I used it and it helped me alot!Daynadays
For me, in Java 8, the t.transform() line is throwing a TransformerException: javax.xml.transform.TransformerException: Can't transform a Source of type javax.xml.transform.stax.StAXSource.Gunboat
I had Apache Xalan as a dependency, and it was providing its own TransformerFactory. One way to work around the problem was to specify the TransformerFactory class explicitly: TransformerFactory transformerFactory = TransformerFactory.newInstance( "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null );Gunboat
This code will result in null when calling result.getNode(). This is because DOMResult does not create a Node by itself. Instead, you have to provide one yourself, preferrably a Document, e.g.,result.setNode(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());.Flier
P
2

Blaise Doughan's answer fails in clean java 7 and 8 due to https://bugs.openjdk.java.net/browse/JDK-8016914

java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl.setXmlVersion(CoreDocumentImpl.java:860)
at com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM.setDocumentInfo(SAX2DOM.java:144)

Funny thing: if you use jaxb unmarshaller, you don't get the NPE:

package com.common.config;

import java.io.*;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;

import org.w3c.dom.*;

public class Demo {


    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        // Advance to root element
        xsr.nextTag(); // TODO: nextTag() can't skip DTD
        xsr.next(); // Advance to first item or EOD

        final JAXBContext jaxbContext = JAXBContext.newInstance();
        final Unmarshaller unm = jaxbContext.createUnmarshaller();
        while(true) {
            // previous unmarshal() already did advance to next element or whitespace
            if (xsr.getEventType() == XMLStreamReader.START_ELEMENT) {
                JAXBElement<Object> jel = unm.unmarshal(xsr, Object.class);
                Node domNode = (Node)jel.getValue();
                System.err.println(domNode.getNodeName());
            } else if (!xsr.hasNext()) {
                    break;
            } else {
                xsr.next();
            }
        }
    }

}

The reason is: com.sun.xml.internal.bind.v2.runtime.unmarshaller.StAXConnector$1 does not implement Locator2 therefore it has no getXMLVersion().

Peggie answered 19/12, 2018 at 11:33 Comment(0)
S
0

you can try XMLDog from JLibs.

It evaluates xpath on xml document using SAX (i.e without loading entire xml into memory). and returns dom nodes for the nodes as they are hit.

thus you can evaluate xpath /Items/Item on your fat xml document. you will be notified as each Item node is parsed. you can process the current Item dom node, and continue.

Thus it is suitable for evaluating xpaths on large documents

Splice answered 21/2, 2012 at 16:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.