How to parse an XML file containing BOM?
Asked Answered
A

2

7

I want to parse an XML file from URL using JDOM. But when trying this:

SAXBuilder builder = new SAXBuilder();
builder.build(aUrl);

I get this exception:

Invalid byte 1 of 1-byte UTF-8 sequence.

I thought this might be the BOM issue. So I checked the source and saw the BOM in the beginning of the file. I tried reading from URL using aUrl.openStream() and removing the BOM with Commons IO BOMInputStream. But to my surprise it didn't detect any BOM. I tried reading from the stream and writing to a local file and parse the local file. I set all the encodings for InputStreamReader and OutputStreamWriter to UTF8 but when I opened the file it had crazy characters.

I thought the problem is with the source URL encoding. But when I open the URL in browser and save the XML in a file and read that file through the process I described above, everything works fine.

I appreciate any help on the possible cause of this issue.

Anemone answered 12/12, 2011 at 21:13 Comment(4)
Can you upload the offending file somewhere?Herod
Does SAXBuilder have a known bug with BOMs in UTF-8? XML parsers should handle them without error. Either way, from that description I'd be more inclined to suspect it's not UTF-8 at all.Vic
@JonHanna Don't know about SAXBuilder. I couldn't find anything pointing to problem with SAXBuilder. But about second point the file states that it's UTF-8 in it's prolog. Also when I try to view it in any other encodings the BOM in the beginning appears.Anemone
For the Bom problem, you can take a look hereDosia
S
4

That HTTP server is sending the content in GZIPped form (Content-Encoding: gzip; see http://en.wikipedia.org/wiki/HTTP_compression if you don't know what that means), so you need to wrap aUrl.openStream() in a GZIPInputStream that will decompress it for you. For example:

builder.build(new GZIPInputStream(aUrl.openStream()));

Edited to add, based on the follow-up comment: If you don't know in advance whether the URL will be GZIPped, you can write something like this:

private InputStream openStream(final URL url) throws IOException
{
    final URLConnection cxn = url.openConnection();
    final String contentEncoding = cxn.getContentEncoding();
    if(contentEncoding == null)
        return cxn.getInputStream();
    else if(contentEncoding.equalsIgnoreCase("gzip")
               || contentEncoding.equalsIgnoreCase("x-gzip"))
        return new GZIPInputStream(cxn.getInputStream());
    else
        throw new IOException("Unexpected content-encoding: " + contentEncoding);
}

(warning: not tested) and then use:

builder.build(openStream(aUrl.openStream()));

. This is basically equivalent to the above — aUrl.openStream() is explicitly documented to be a shorthand for aUrl.openConnection().getInputStream() — except that it examines the Content-Encoding header before deciding whether to wrap the stream in a GZIPInputStream.

See the documentation for java.net.URLConnection.

Stendhal answered 12/12, 2011 at 22:29 Comment(3)
Dude thanks that solved the problem. You have no Idea how much you helped me. One question though: If I use GzipInputStream to wrap any input stream, will that cause any problem with the ones that are not gzipped?Anemone
I tested it and yes it does make problems. It throws IOException if the stream is not in Gzip format.Anemone
@doctrey: You're welcome! Re: non-GZIPped streams: Yeah, that would be a problem, since GZIPInputStream requires that its input be GZIPped. I've edited my answer to give (untested) code to handle both cases.Stendhal
L
0

You might find you can avoid handling encoded responses by sending a blank Accept-Encoding header. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html: "If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding.". That seems to be occurring here.

Lamellirostral answered 12/12, 2011 at 23:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.