Java (JAXP) XML parsing differences of DocumentBuilder
Asked Answered
D

1

8

Is there any kind of difference between

  1. DocumentBuilder.parse(InputStream) and
  2. DocumentBuilder.parse(InputSource) ?

I could only find that for the first case, the parser detects the encoding from the stream so it is safer while in the latter I am not sure if it is required to set the encoding.

Any other points (e.g. performance) I should be aware?

Debi answered 23/11, 2010 at 7:0 Comment(0)
F
6

The main difference is that the first one allows you to read your XML content only from binary sources, based on the implementations of the InputStream interface. I.e: directly from a file (using a FileInputStream), an open Socket (from Socket.getInputStream()), etc.

The second one, DocumentBuilder.parse(InputSource), allows you to read data from binary sources too (this is, an InputStream impl) and from character sources (Reader implementations). So, with this one you can use an XML String (using a StringReader), or a BufferedReader.

While with the second method you already have the chance to handle InputStreams, the first one is a kind of shortcut, so when you have an InputStream you don't need to create a new InputSource. In fact, the source code of the InputStream method is:

public Document parse(InputStream is)
    throws SAXException, IOException {
    if (is == null) {
        throw new IllegalArgumentException("InputStream cannot be null");
    }

    InputSource in = new InputSource(is);
    return parse(in);
}
Fib answered 23/11, 2010 at 15:18 Comment(5)
I have corrected my post. You are write, I meant InputSource. I already have a valid XML string in a String, and I could not decide which approach is better, i.e. convert it to InputStream or InputSource. I read in ibm.com/developerworks/xml/library/x-tipsaxis.html that with InputStream, the character encoding is detected from the stream it self, while with InputSource, you should set it and that could end-up in parsing problems if the encoding set is not the encoding actually used in the string, and I was wondering If there were additional subtleties I should be awareDebi
In your case, I would use the InputSource, as you already have the XML String serialized. To set the encoding, I would use the "setEncoding" method of InputSource.Fib
Strange. The code you posted as the source code seems not consistent with what the article(from my previous comment) says about encoding. Because if this is the implementation of parse(Input) then it is incorrect about the encoding functionality the article mentions. Or I am missing something?Debi
The code I posted is the source code for the JDK 6 version. The article you posted is from 2002, so maybe the implementation has changed since then.Fib
Anyway, the article has a point I don't really get. To use a String in a inputStream, you'll need to get the String byte[]. For doing this, you can use the getBytes() which uses the VM default encoding, or the getBytes(String charset), in which you define the charset to use for the decoding. If you already know the charset encoding, you can set it to the InputSource. Otherwise, the InputSource will use the VM default. So, nowadays, I think there is no practical difference in the charset handling of both approaches.Fib

© 2022 - 2024 — McMap. All rights reserved.