stax - get xml node as string
Asked Answered
G

6

6

xml looks like so:

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

I'm using stax to process one "<statement>" at a time and I got that working. I need to get that entire statement node as a string so I can create "123.xml" and "456.xml" or maybe even load it into a database table indexed by account.

using this approach: http://www.devx.com/Java/Article/30298/1954

I'm looking to do something like this:

String statementXml = staxXmlReader.getNodeByName("statement");

//load statementXml into database
Galliwasp answered 4/12, 2010 at 3:52 Comment(1)
What is your question exactly?Huddle
H
0

Why not just use xpath for this?

You could have a fairly simple xpath to get all 'statement' nodes.

Like so:

//statement

EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.

EDIT #2: Using dom4j, this is how you would do it: (from their cookbook)

String text = "your xml here";
Document document = DocumentHelper.parseText(text);

public void bar(Document document) {
   List list = document.selectNodes( "//statement" );
   // loop through node data
}
Huddle answered 4/12, 2010 at 5:0 Comment(3)
There area also standard XPath libraries in the JDK/JRE: #3940136Mise
The poster explicitly mentioned StAX, so I don't think pointers to dom4j or other library did help him much...Coprology
Given that the OP never asked a question, the suggestion to use xPath is as good as anything. Maybe better.Piscina
C
11

I had a similar task and although the original question is older than a year, I couldn't find a satisfying answer. The most interesting answer up to now was Blaise Doughan's answer, but I couldn't get it running on the XML I am expecting (maybe some parameters for the underlying parser could change that?). Here the XML, very simplyfied:

<many-many-tags>
    <description>
        ...
        <p>Lorem ipsum...</p>
        Devils inside...
        ...
    </description>
</many-many-tags>

My solution:

public static String readElementBody(XMLEventReader eventReader)
    throws XMLStreamException {
    StringWriter buf = new StringWriter(1024);

    int depth = 0;
    while (eventReader.hasNext()) {
        // peek event
        XMLEvent xmlEvent = eventReader.peek();

        if (xmlEvent.isStartElement()) {
            ++depth;
        }
        else if (xmlEvent.isEndElement()) {
            --depth;

            // reached END_ELEMENT tag?
            // break loop, leave event in stream
            if (depth < 0)
                break;
        }

        // consume event
        xmlEvent = eventReader.nextEvent();

        // print out event
        xmlEvent.writeAsEncodedUnicode(buf);
    }

    return buf.getBuffer().toString();
}

Usage example:

XMLEventReader eventReader = ...;
while (eventReader.hasNext()) {
    XMLEvent xmlEvent = eventReader.nextEvent();
    if (xmlEvent.isStartElement()) {
        StartElement elem = xmlEvent.asStartElement();
        String name = elem.getName().getLocalPart();

        if ("DESCRIPTION".equals(name)) {
            String xmlFragment = readElementBody(eventReader);
            // do something with it...
            System.out.println("'" + fragment + "'");
        }
    }
    else if (xmlEvent.isEndElement()) {
        // ...
    }
}

Note that the extracted XML fragment will contain the complete extracted body content, including white space and comments. Filtering those on demand, or making the buffer size parametrizable have been left out for code brevity:

'
    <description>
        ...
        <p>Lorem ipsum...</p>
        Devils inside...
        ...
    </description>
    '
Coprology answered 25/7, 2012 at 13:44 Comment(2)
Is there way to print the string without the namespace?Trilobite
I'm not sure I understand your question, what namespace? Can you give an example?Coprology
M
6

You can use StAX for this. You just need to advance the XMLStreamReader to the start element for statement. Check the account attribute to get the file name. Then use the javax.xml.transform APIs to transform the StAXSource to a StreamResult wrapping a File. This will advance the XMLStreamReader and then just repeat this process.

import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer t = tf.newTransformer();
            File file = new File("out" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

}
Mise answered 4/12, 2010 at 12:43 Comment(3)
Using while (xsr.nextTag...) will fail. The stax documentation for xsr.nextTag() states that an exception will be thrown if xsr.hasNext() is false and next tag is called. Also, when using xsr.nextTag(), if other than white space characters, COMMENT, PROCESSING_INSTRUCTION, START_ELEMENT, END_ELEMENT are encountered, an exception is thrown.Eject
When I use the above code, I am getting the following error Exception in thread "main" net.sf.saxon.trans.XPathException: org.w3c.dom.DOMException: HIERARCHY_REQUEST_ERR: An attempt was made to insert a node where it is not permitted. Any Idea?Gory
Conceptually calling xsr.nextTag() is wrong since XMLStreamReader maybe starts already from the right tag if "input.xml" does not contain headers. Trying all possible cases I always receive the error: java.lang.IllegalStateException: Attempt to output end tag with no matching start tag. @Coprology solution is the only one valid for mePaeon
V
4

Stax is a low-level access API, and it does not have either lookups or methods that access content recursively. But what you actually trying to do? And why are you considering Stax?

Beyond using a tree model (DOM, XOM, JDOM, Dom4j), which would work well with XPath, best choice when dealing with data is usually data binding library like JAXB. With it you can pass Stax or SAX reader and ask it to bind xml data into Java beans and instead of messing with xml process Java objects. This is often more convenient, and it is usually quite performance. Only trick with larger files is that you do not want to bind the whole thing at once, but rather bind each sub-tree (in your case, one 'statement' at a time). This is easiest done by iterating Stax XmlStreamReader, then using JAXB to bind.

Vendace answered 4/12, 2010 at 5:24 Comment(0)
G
1

I've been googling and this seems painfully difficult.

given my xml I think it might just be simpler to:

StringBuilder buffer = new StringBuilder();
for each line in file {
   buffer.append(line)
   if(line.equals(STMT_END_TAG)){
      parse(buffer.toString())
      buffer.delete(0,buffer.length)
   }
 }

 private void parse(String statement){
    //saxParser.parse( new InputSource( new StringReader( xmlText ) );
    // do stuff
    // save string
 }
Galliwasp answered 4/12, 2010 at 4:12 Comment(0)
H
0

Why not just use xpath for this?

You could have a fairly simple xpath to get all 'statement' nodes.

Like so:

//statement

EDIT #1: If possible, take a look at dom4j. You could read the String and get all 'statement' nodes fairly simply.

EDIT #2: Using dom4j, this is how you would do it: (from their cookbook)

String text = "your xml here";
Document document = DocumentHelper.parseText(text);

public void bar(Document document) {
   List list = document.selectNodes( "//statement" );
   // loop through node data
}
Huddle answered 4/12, 2010 at 5:0 Comment(3)
There area also standard XPath libraries in the JDK/JRE: #3940136Mise
The poster explicitly mentioned StAX, so I don't think pointers to dom4j or other library did help him much...Coprology
Given that the OP never asked a question, the suggestion to use xPath is as good as anything. Maybe better.Piscina
M
0

I had the similar problem and found the solution. I used the solution proposed by @t0r0X but it does not work well in the current implementation in Java 11, the method xmlEvent.writeAsEncodedUnicode creates the invalid string representation of the start element (in the StartElementEvent class) in the result XML fragment, so I had to modify it, but then it seems to work well, what I could immediatelly verify by the parsing of the fragment by DOM and JaxBMarshaller to specific data containers.

In my case I had the huge structure

<Orders>
   <ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
      .....
   </ns2:SyncOrder>
   <ns2:SyncOrder xmlns:ns2="..." xmlns:ns3="....." ....>
      .....
   </ns2:SyncOrder>
   ...
</Orders>

in the file of multiple hundred megabytes (a lot of repeating "SyncOrder" structures), so the usage of DOM would lead to a large memory consumption and slow evaluation. Therefore I used the StAX to split the huge XML to smaller XML pieces, which I have analyzed with DOM and used the JaxbElements generated from the xsd definition of the element SyncOrder (This infrastructure I had from the webservice, which uses the same structure, but it is not important).

In this code there can be seen Where the XML fragment has een created and could be used, I used it directly in other processing...

private static <T> List<T> unmarshallMultipleSyncOrderXmlData(
        InputStream aOrdersXmlContainingSyncOrderItems,
        Function<SyncOrderType, T> aConversionFunction) throws XMLStreamException, ParserConfigurationException, IOException, SAXException {

    DocumentBuilderFactory locDocumentBuilderFactory = DocumentBuilderFactory.newInstance();
    locDocumentBuilderFactory.setNamespaceAware(true);
    DocumentBuilder locDocBuilder = locDocumentBuilderFactory.newDocumentBuilder();

    List<T> locResult = new ArrayList<>();
    XMLInputFactory locFactory = XMLInputFactory.newFactory();
    XMLEventReader locReader = locFactory.createXMLEventReader(aOrdersXmlContainingSyncOrderItems);

    boolean locIsInSyncOrder = false;
    QName locSyncOrderElementQName = null;
    StringWriter locXmlTextBuffer = new StringWriter();
    int locDepth = 0;
    while (locReader.hasNext()) {

        XMLEvent locEvent = locReader.nextEvent();

        if (locEvent.isStartElement()) {
            if (locDepth == 0 && Objects.equals(locEvent.asStartElement().getName().getLocalPart(), "Orders")) {
                locDepth++;
            } else {
                if (locDepth <= 0)
                    throw new IllegalStateException("There has been passed invalid XML stream intot he function. "
                                                                                    + "Expecting the element 'Orders' as the root alament of the document, but found was '"
                                                                                    + locEvent.asStartElement().getName().getLocalPart() + "'.");
                locDepth++;
                if (locSyncOrderElementQName == null) {
                    /* First element after the "Orders" has passed, so we retrieve
                     * the name of the element with the namespace prefix: */
                    locSyncOrderElementQName = locEvent.asStartElement().getName();
                }
                if(Objects.equals(locEvent.asStartElement().getName(), locSyncOrderElementQName)) {
                    locIsInSyncOrder = true;
                }
            }
        } else if (locEvent.isEndElement()) {
            locDepth--;
            if(locDepth == 1 && Objects.equals(locEvent.asEndElement().getName(), locSyncOrderElementQName)) {
                locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
                /* at this moment the call of locXmlTextBuffer.toString() gets the complete fragment 
                 * of XML containing the valid SyncOrder element, but I have continued to other processing,
                 * which immediatelly validates the produced XML fragment is valid and passes the values 
                 * to communication object: */
                Document locDocument = locDocBuilder.parse(new ByteArrayInputStream(locXmlTextBuffer.toString().getBytes()));
                SyncOrderType locItem = unmarshallSyncOrderDomNodeToCo(locDocument);
                locResult.add(aConversionFunction.apply(locItem));
                locXmlTextBuffer = new StringWriter();
                locIsInSyncOrder = false;
            }
        }
        if (locIsInSyncOrder) {
            if (locEvent.isStartElement()) {
                /* here replaced the standard implementation of startElement's method writeAsEncodedUnicode: */ 
                locXmlTextBuffer.write(startElementToStrng(locEvent.asStartElement()));
            } else {
                locEvent.writeAsEncodedUnicode(locXmlTextBuffer);
            }
        }
    }
    return locResult;
}

private static String startElementToString(StartElement aStartElement) {

    StringBuilder locStartElementBuffer = new StringBuilder();

    // open element
    locStartElementBuffer.append("<");
    String locNameAsString = null;
    if ("".equals(aStartElement.getName().getNamespaceURI())) {
        locNameAsString = aStartElement.getName().getLocalPart();
    } else if (aStartElement.getName().getPrefix() != null
            && !"".equals(aStartElement.getName().getPrefix())) {
        locNameAsString = aStartElement.getName().getPrefix()
                + ":" + aStartElement.getName().getLocalPart();
    } else {
        locNameAsString = aStartElement.getName().getLocalPart();
    }

    locStartElementBuffer.append(locNameAsString);

    // add any attributes
    Iterator<Attribute> locAttributeIterator = aStartElement.getAttributes();
    Attribute attr;
    while (locAttributeIterator.hasNext()) {
        attr = locAttributeIterator.next();
        locStartElementBuffer.append(" ");
        locStartElementBuffer.append(attributeToString(attr));
    }

    // add any namespaces
    Iterator<Namespace> locNamespaceIterator = aStartElement.getNamespaces();
    Namespace locNamespace;
    while (locNamespaceIterator.hasNext()) {
        locNamespace = locNamespaceIterator.next();
        locStartElementBuffer.append(" ");
        locStartElementBuffer.append(attributeToString(locNamespace));
    }

    // close start tag
    locStartElementBuffer.append(">");

    // return StartElement as a String
    return locStartElementBuffer.toString();
}

private static String attributeToString(Attribute aAttr) {
    if( aAttr.getName().getPrefix() != null && aAttr.getName().getPrefix().length() > 0 )
        return aAttr.getName().getPrefix() + ":" + aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
    else
        return aAttr.getName().getLocalPart() + "='" + aAttr.getValue() + "'";
}

public static SyncOrderType unmarshallSyncOrderDomNodeToCo(
        Node aSyncOrderItemNode) {
    Source locSource = new DOMSource(aSyncOrderItemNode);
    Object locUnmarshalledObject = getMarshallerAndUnmarshaller().unmarshal(locSource);
    SyncOrderType locCo = ((JAXBElement<SyncOrderType>) locUnmarshalledObject).getValue();
    return locCo;
}
Mathewson answered 24/12, 2021 at 12:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.