Parsing an XML stream with no root element
Asked Answered
N

6

17

I need to parse a continuous stream of well-formed XML elements, to which I am only given an already constructed java.io.Reader object. These elements are not enclosed in a root element, nor are they prepended with an XML header like <?xml version="1.0"?>", but are otherwise valid XML.

Using the Java org.xml.sax.XMLReader class does not work, because the XML Reader expects to parse well-formed XML, starting with an enclosing root element. So, it just reads the first element in the stream, which it perceives as the root, and fails in the next one, with the typical

org.xml.sax.SAXParseException: The markup in the document following the root element must be well-formed.

For files that do not contain a root element, but where such element does exist or can be defined (and is called, say, MyRootElement), one can do something like the following:

        Strint path = <the full path to the file>;

        XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();

        StringBuilder buffer = new StringBuilder();

        buffer.append("<?xml version=\"1.0\"?>\n");
        buffer.append("<!DOCTYPE MyRootElement ");
        buffer.append("[<!ENTITY data SYSTEM \"file:///");
        buffer.append(path);
        buffer.append("\">]>\n");
        buffer.append("<MyRootElement xmlns:...>\n");
        buffer.append("&data;\n");
        buffer.append("</MyRootElement>\n");

        InputSource source = new InputSource(new StringReader(buffer.toString()));

        xmlReader.parse(source);

I have tested the above by saving part of the java.io.Reader output to a file and it works. However, this approach is not applicable in my case and such extra information (XML header, root element) cannot be inserted, since the java.io.Reader object passed to my code is already constructed.

Essentially, I am looking for "fragmented XML parsing". So, my question is, can it be done, using standard Java APIs (including the org.sax.xml.* and java.xml.* packages)?

Nautilus answered 10/7, 2011 at 11:20 Comment(1)
V
15

SequenceInputStream comes to the rescue:

    SAXParserFactory saxFactory = SAXParserFactory.newInstance();
    SAXParser parser = saxFactory.newSAXParser();

    parser.parse(
        new SequenceInputStream(
            Collections.enumeration(Arrays.asList(
            new InputStream[] {
                new ByteArrayInputStream("<dummy>".getBytes()),
                new FileInputStream(file),//bogus xml
                new ByteArrayInputStream("</dummy>".getBytes()),
            }))
        ), 
        new DefaultHandler()
    );
Verina answered 24/3, 2012 at 10:47 Comment(0)
W
9

You can wrap your given Reader in a FilterReader subclass that you implement to do more or less what you're doing here.

Edit:

While this is similar to the proposal to implement your own Reader delegating to the given Reader object given by a couple other answers, just about all methods in FilterReader would have to be overridden, so you may not gain much from using the superclass.

An interesting variation on the other proposals might be to implement a SequencedReader which wraps multiple Reader objects and shifts to the next in the sequence when one is used up. Then you could pass in a StringReader object with the start text for the root you want to add, the original Reader and another StringReader with the closing tag.

Wadsworth answered 10/7, 2011 at 11:38 Comment(0)
M
5

You can write your own Reader-Implementation that encapsulates the Reader-instance you're given. This new Reader should do just what you're doing in your example code, provide the header and root element, then the data from the underlying reader and in the end the closing root tag. By going this way you can provide a valid XML stream to the XML parser and you can as well use the Reader object passed to your code.

Modicum answered 10/7, 2011 at 11:41 Comment(2)
+1 Great minds think alike (although mine thought it 1 minute before yours :) )Calenture
+1 to both of you. Directly implementing a Reader may be better than trying to subclass FilterReader as in my response.Wadsworth
C
4

You can create your own Reader that delegates to the provided Reader, like this:

final Reader reader = <whatever you are getting>;

Reader wrappedReader = new Reader()
{
    Reader readerCopy = reader;
    String start = "<?xml version=\"1.0\"?><MyRootElement>";
    String end = "</MyRootElement>";
    int index;

    @Override
    public void close() throws IOException
    {
        readerCopy.close();
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException
    {
        // You'll have to get the logic right here - this is only placeholder code

        if (index < start.length())
        {
            // Copy from start to cbuf
        }
        int result = readerCopy.read(cbuf, off, len);

        if (result == -1) {
            // Copy from end
        }

        index += len; 

        return result;
    }
};

You'll have to fill in the logic to firstly read from start, then delegate to the reader in the middle, and finally when the reader is empty, read from end.

This approach will work though.

Calenture answered 10/7, 2011 at 11:39 Comment(1)
But isn't there really any XML parsing class that can read "fragmented" XML?Nautilus
H
3

Just insert dummy root element. The most elegant solution I can think about is to create your own InputStream or Reader that wraps regular InputSteam/Reader and returns the dummy <dummyroot> when you call its read() / readLine() first time and then returns the result of payload stream. This should satisfy SAX parser.

Harpy answered 10/7, 2011 at 11:37 Comment(0)
A
2

This answer works for me but I had to do the extra step of creating an inputsource from the SequenceInputStream.

XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler((ContentHandler) this);
// Trying to add root element
Enumeration<InputStream> streams = Collections.enumeration(
    Arrays.asList(new InputStream[] {
        new ByteArrayInputStream("<TopNode>".getBytes()),
        new FileInputStream(xmlFile),//bogus xml
        new ByteArrayInputStream("</TopNode>".getBytes()),
}));
InputSource is = new InputSource(seqStream);
xmlReader.parse(is);
Ardenia answered 10/4, 2013 at 18:10 Comment(3)
Usually answers are reordered , so "Answer 3" is relative, what answer do you mean?Safety
I meant answer given by user656449Ardenia
Doesn't compile - notice 'seqStream' isn't defined anywhere, and renaming to stream generates a 'no suitable constructor' error for new InputSource.Require

© 2022 - 2024 — McMap. All rights reserved.