Is there a Push-based/Non-blocking XML Parser for Java?
Asked Answered
S

7

17

I'm looking for an XML parser that instead of parsing from an InputStream or InputSource will instead allow blocks of text to be pushed into the parser. E.g. I would like to have something like the following:

public class DataReceiver {
    private SAXParser parser = //...
    private DefaultHandler handler = //...

    /**
     * Called each time some data is received.
     */
    public void onDataReceived(byte[] data) {
        parser.push(data, handler);
    }
}

The reason is that I would like something that will play nice with the NIO networking libraries rather than having to revert back to a thread per connection model required to support a blocking InputStream.

Surfperch answered 21/6, 2009 at 12:5 Comment(4)
It interesting to know, how long your XML-Docs are.Robinette
I don't have any XML docs, I am looking at implementing an XMPP server, hence I'm looking for something that works well with the NIO networking libraries.Surfperch
It might be a good idea to have a look at open-source XMPP Servers written in Java to see how they solve the problem. Tigase and OpenFire are the first candidates who came to my mind.Magenmagena
Sigh, had the same question. This is why all runtimes should have call-cc. Then we could get this just by implementing InputStream.Towel
K
3

This is a (April 2009) post from the Xerces J-Users mailing list, where the original poster is having the exact same issue. One potentially very good response by "Jeff" is given, but there is no follow up to the original poster's response:

http://www.nabble.com/parsing-an-xml-document-chunk-by-chunk-td22945319.html

It's potentially new enough to bump on the list, or at very least help with the search.

Edit

Found another useful link, mentioning a library called Woodstox and describing the state of Stream vs. NIO based parsers and some possible approaches to emulating a stream:

http://markmail.org/message/ogqqcj7dt3lwkbov

Knp answered 25/6, 2009 at 20:37 Comment(2)
Good findings. Gave me an idea. As Xerces is open source, get it (or in fact any open source XML parser that is small enough) and hack the position where it reads the bytes from the input stream and create a clever way state-save state-restore to allow an early return / continue. But I have no clue how to implement it.Except
possibly good call on that hacking. also, I emailed the original poster and sent him this thread. I hope he's figured something out and will share. crosses fingersKnp
W
7

Surprisingly no one mentioned one Java XML parser that does implement non-blocking ("async") parsing: Aalto. Part of the reason may be lack of documentation (and its low level of activity). Aalto implements basic Stax API, but also minor extensions to allow pushing input (this part has not been finalized; functionality exists but API is not finalized). For more information you could check out related discussion group.

Wilmott answered 31/8, 2010 at 5:44 Comment(7)
I didn't come up in a number of Google searches, I'll check it out.Surfperch
Yeah, it's not easy to find with generic searchs, given its low profile. Hopefully that will change soon since there are some more developers interested in it.Wilmott
Aalto was recently relicensed as Apache License too: github.com/FasterXML/aalto-xmlLycian
Plus a blog entry explaining how to use non-blocking API extension: cowtowncoder.com/blog/archives/2011/03/entry_451.htmlWilmott
This should have been the selected answer. We're using it in production and are quite happy with it.Captive
Good to know -- I don't get much feedback on usage, so like hearing Aalto is being used (there is a Yahoo discussion group at tech.groups.yahoo.com/group/aalto-xml-interest)Wilmott
Aalto works great, I've published some helper classes: github.com/skjolber/async-stax-utilsThermolabile
E
4

Edit: Now I see. You receive the XML in chunks and you want to feed it into a proper XML parser. So you need an object, which is a queue at the one end, and an InputStream at the other end?

You could aggregate the byte arrays received into a ByteArrayOutputStream, convert it to ByteArrayInputStream and feed it to the SAXParser.

Or you could check out the PipedInputStream/PipedOutputStream pair. In this case, you'll need to do the parsing in another thread as SAX parser uses the current thread to emit events, blocking your receive().

Edit: Based on the comments I suggest taking the aggregation route. You collect the chunks into a ByteArrayOutputStream. To know whether you received all chunks for your XML, check if the current chunk or the contents of the ByteArrayOutputStream contains your end tag of the XML root node. Then you could just pass the data into a SAXParser which can now run in the current thread without problems. To avoid unnecessary array re-creation you could implement your own unsynchronized simple byte array wrapper or look for such implementation.

Except answered 21/6, 2009 at 20:9 Comment(3)
The PipedInputStream/PipedOutputStream looks like a good way to go to provide traditional parsers with the InputStream and push in chunks.Wetterhorn
This approach still requires a thread for each client connection, it just shifts the problem to a different area of the code. If I am to go with thread per connection model, then I would probably just use traditional Sockets and InputStreams as the solution would be simpler.Surfperch
Beats me. I never heard a Java parser having such a feature you are looking for. I guess its very hard to implement one without an equivalent of the C# yield return construct - e.g. the management of stopping the parsing and return to the caller and later on continue is a complex state-machine thing.Except
K
3

This is a (April 2009) post from the Xerces J-Users mailing list, where the original poster is having the exact same issue. One potentially very good response by "Jeff" is given, but there is no follow up to the original poster's response:

http://www.nabble.com/parsing-an-xml-document-chunk-by-chunk-td22945319.html

It's potentially new enough to bump on the list, or at very least help with the search.

Edit

Found another useful link, mentioning a library called Woodstox and describing the state of Stream vs. NIO based parsers and some possible approaches to emulating a stream:

http://markmail.org/message/ogqqcj7dt3lwkbov

Knp answered 25/6, 2009 at 20:37 Comment(2)
Good findings. Gave me an idea. As Xerces is open source, get it (or in fact any open source XML parser that is small enough) and hack the position where it reads the bytes from the input stream and create a clever way state-save state-restore to allow an early return / continue. But I have no clue how to implement it.Except
possibly good call on that hacking. also, I emailed the original poster and sent him this thread. I hope he's figured something out and will share. crosses fingersKnp
E
1

Check openfire's XMLLeightweightParser and how it generates XML messages from single chunks because of NIO. The whole project is a great source for answers regarding NIO and XMPP questions.

Edema answered 31/7, 2009 at 20:4 Comment(3)
This is the closest I've seen, unfortunately they don't offer it as a separate library and the don't translate into XML object (Elements, Attributes etc.)Surfperch
XMLLeightweightParser can easily be used standalone. It just assures that you have complete xml tags. Openfire then feeds these chunks into XPP3 to parse a complete Element object. I am using the same thing in XMPP inspired server and it works like a charm.Edema
That still has to require reading chunks longer that are strictly necessary, given that xpp3 uses blocking io (like most other pull parsers). That can be better than fully blocking, esp. if you explicitly frame pieces, but it's not quite ideal.Wilmott
F
1

Adding another answer as this question remains high for relevant Google searches - aalto-xml 0.9.7 (March 2011) has asynchronous XML pasing. This allows you to pass arbitrary sized chunks of a document to continue parsing, and a new StaX event type EVENT_INCOMPLETE to indicate the input buffer is exhausted and the document remains incomplete.

This is Tatu Salorant's (the author's) example:

     byte[] msg = "<html>Very <b>simple</b> input document!</html>".getBytes();
      AsyncXMLStreamReader asyncReader = new InputFactoryImpl().createAsyncXMLStreamReader();
      final AsyncInputFeeder feeder = asyncReader.getInputFeeder();
      int inputPtr = 0; // as we feed byte at a time
      int type = 0;

      do {
        // May need to feed multiple "segments"
        while ((type = asyncReader.next()) == AsyncXMLStreamReader.EVENT_INCOMPLETE) {
          feeder.feedInput(msg, inputPtr++, 1);
          if (inputPtr >= msg.length) { // to indicate end-of-content (important for error handling)
            feeder.endOfInput();
          }
        }
        // and once we have full event, we just dump out event type (for now)
        System.out.println("Got event of type: "+type);
        // could also just copy event as is, using Stax, or do any other normal non-blocking handling:
        // xmlStreamWriter.copyEventFromReader(asyncReader, false);
      } while (type != AsyncXMLStreamReader.END_DOCUMENT);
Forgotten answered 21/6, 2012 at 9:47 Comment(0)
G
1

NioSax works with ByteBuffers

http://blog.retep.org/2010/06/25/niosax-sax-style-xml-parser-for-java-nio/

The source code for the latest version I could find (10.6 from 2010) is in the Sonatype Maven repository:

https://oss.sonatype.org/content/repositories/releases/uk/org/retep/

Goatsbeard answered 9/6, 2013 at 19:51 Comment(0)
Y
0

I'm sorry, I didn't managed to solve this problem. I could not find a parser like the one I need. But I'm thinking to write one by my self. A very simple one: just as fisibility study, but enough to solve my problem and hopfully yours. Unortunatelly I have been very buisy and the next two weeks I'll be out, but maybe in july I'll start working on it. I'll let you know as soon as I have something working.

mt

Yasui answered 25/6, 2009 at 23:34 Comment(2)
If you do let me know. If you can make it open source I will hopefully be able to contribute.Surfperch
I don't know, but maybe the Yielder framework could help in it? See chaoticjava.com/posts/category/code/java/frameworks/yielder for the details. The inner workings is ugly though as it used bytecode re-enginering to support the early-return-continue-later format.Except

© 2022 - 2024 — McMap. All rights reserved.