why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?
Asked Answered
S

3

7

I have an xml with utf8 encoding. And this file contains BOM a beginning of the file. So during parsing I am facing with org.xml.sax.SAXParseException: Content is not allowed in prolog. I can not remove those 3 bytes from the files. I can not load file into memory and remove them here (files are big). So for performance reasons I'm using SAX parser and want just to skip those 3 bytes if they are present before "" tag. Should I inherit InputStreamReader for this?

I'm new in java - show me the right way please.

Sassafras answered 18/3, 2011 at 14:58 Comment(1)
possible duplicate of Byte order mark screws up file reading in JavaChastise
R
4

This has come up before, and I found the answer on Stack Overflow when it happened to me. The linked answer uses a PushbackInputStream to test for the BOM.

Rocca answered 18/3, 2011 at 15:15 Comment(0)
J
3

I've experienced the same problem and I've solved it with this code:

private static InputStream checkForUtf8BOM(InputStream inputStream) throws IOException {
    PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
    byte[] bom = new byte[3];
    if (pushbackInputStream.read(bom) != -1) {
        if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
            pushbackInputStream.unread(bom);
        }
    }
    return pushbackInputStream;
}
Judicature answered 18/3, 2011 at 15:17 Comment(2)
This is for UTF8... I assume UTF16 would differ (I believe its only 2 bytes)?Seise
Sorry for the late. Yes, UTF16 has BOM with only two bytes: 0xFE 0xFF (big-endian) or 0xFF 0xFE (little-endian).Judicature
E
2
private static char[] UTF32BE = { 0x0000, 0xFEFF };
private static char[] UTF32LE = { 0xFFFE, 0x0000 };
private static char[] UTF16BE = { 0xFEFF };
private static char[] UTF16LE = { 0xFFFE };
private static char[] UTF8 = { 0xEFBB, 0xBF };

private static boolean removeBOM(Reader reader, char[] bom) throws Exception {
    int bomLength = bom.length;
    reader.mark(bomLength);
    char[] possibleBOM = new char[bomLength];
    reader.read(possibleBOM);
    for (int x = 0; x < bomLength; x++) {
        if ((int) bom[x] != (int) possibleBOM[x]) {
            reader.reset();
            return false;
        }
    }
    return true;
}

private static void removeBOM(Reader reader) throws Exception {
    if (removeBOM(reader, UTF32BE)) {
        return;
    }
    if (removeBOM(reader, UTF32LE)) {
        return;
    }
    if (removeBOM(reader, UTF16BE)) {
        return;
    }
    if (removeBOM(reader, UTF16LE)) {
        return;
    }
    if (removeBOM(reader, UTF8)) {
        return;
    }
}

usage:

// xml can be read from a file, url or string through a stream
URL url = new URL("some xml url");
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(url.openStream()));
removeBOM(bufferedReader);
Eadwina answered 6/5, 2013 at 8:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.