Is there a good XML parser for light-scanning an XML file to get the byte offsets of elements?
Asked Answered
J

2

7

We have a system where we're processing XML files where the file itself is too large to fit in memory.

As part of processing, we want to quickly scan through to record the offset of relevant elements, so that later on, we can seek immediately to those elements and parse just the piece we want (since the smaller slice of the file would fit in memory, we can afford to use a DOM or whatever for that part.)

Obviously we could just write our own XML parser from scratch, but before making yet another XML parser, I wanted to see if there were any other options available.

What follows is a list of the things we already know about.

  1. Using StAX should work, but doesn't. Here's a demonstration of that. I made an XML example where there are characters longer than one byte to demonstrate that the returned byte offset is not correct once you start passing these characters. Note that even though the method in the API is called getCharacterOffset(), the documentation says that it returns the byte offset if you passed in a byte stream - which is what this code is doing.

    @Test
    public void testByteOffsetsFromStreamParser() throws Exception {
        // byte counts are size required for UTF-8, I checked using Ishida's tool.
        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
                     "<root>\n"
                     " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>\n" +
                     " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>\n" +
                     " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>\n" +
                     "</root>\n";
        byte[] xmlBytes = xml.getBytes("UTF-8");
        assertThat(xmlBytes.length, is(equalTo(171)));  // = 171 from above
    
        String implToTest = "com.sun.xml.internal.stream.XMLInputFactoryImpl";
        //String implToTest = "com.ctc.wstx.stax.WstxInputFactory";
        XMLInputFactory factory =
            Class.forName(implToTest).asSubclass(XMLInputFactory.class).newInstance();
        factory.setProperty("javax.xml.stream.isCoalescing", false);
        factory.setProperty("javax.xml.stream.supportDTD", false);
        XMLEventReader reader = factory.createXMLEventReader(
            new ByteArrayInputStream(xmlBytes));
        try {
            XMLEvent event;
    
            event = reader.nextTag(); // <root>
            checkByteOffset(event, 39);
    
            event = reader.nextTag(); // <leaf>
            checkByteOffset(event, 47);
    
            event = reader.nextEvent(); // (text)
            checkByteOffset(event, 53);
    
            event = reader.nextTag(); // </leaf>
            checkByteOffset(event, 77);
    
            event = reader.nextTag(); // <leaf>
            checkByteOffset(event, 86);
    
            event = reader.nextEvent(); // (text)
            checkByteOffset(event, 92);
    
            event = reader.nextTag(); // </leaf>
            checkByteOffset(event, 116);
    
            event = reader.nextTag(); // <leaf>
            checkByteOffset(event, 125);
    
            event = reader.nextEvent(); // (text)
            checkByteOffset(event, 131);
    
            event = reader.nextTag(); // </leaf>
            checkByteOffset(event, 155);
    
            event = reader.nextTag(); // </root>
            checkByteOffset(event, 163);
        } finally {
            reader.close(); // no auto-close :(
        }
    }
    
    private void checkByteOffset(XMLEvent event, int expectedOffset) {
        System.out.println("Expected Offset: " + expectedOffset +
            "    - Actual Offset: " + event.getLocation().getCharacterOffset());
    }
    

    Results for the factory which you get by default in Java 7:

    Expected Offset: 39    - Actual Offset: 45
    Expected Offset: 47    - Actual Offset: 53
    Expected Offset: 53    - Actual Offset: 63
    Expected Offset: 77    - Actual Offset: 68
    Expected Offset: 86    - Actual Offset: 76
    Expected Offset: 92    - Actual Offset: 86
    Expected Offset: 116    - Actual Offset: 91
    Expected Offset: 125    - Actual Offset: 99
    Expected Offset: 131    - Actual Offset: 109
    Expected Offset: 155    - Actual Offset: 114
    Expected Offset: 163    - Actual Offset: 122
    

    Results for Woodstox, which we tried based on some other stackoverflow post suggestion. Note that although it starts out being correct, after a few lines, it's even more incorrect than the default parser:

    Expected Offset: 39    - Actual Offset: 39
    Expected Offset: 47    - Actual Offset: 47
    Expected Offset: 53    - Actual Offset: 53
    Expected Offset: 77    - Actual Offset: 61
    Expected Offset: 86    - Actual Offset: 70
    Expected Offset: 92    - Actual Offset: 76
    Expected Offset: 116    - Actual Offset: 84
    Expected Offset: 125    - Actual Offset: 93
    Expected Offset: 131    - Actual Offset: 99
    Expected Offset: 155    - Actual Offset: 107
    Expected Offset: 163    - Actual Offset: 115
    
  2. We're aware of a library called VTD-XML which does almost exactly what we're after, but it has two problems. The first problem is that it reads the whole file into memory and the file itself won't fit. The second problem is that the licence is GPL and not compatible with the rest of our stuff.

Jerrodjerrol answered 9/1, 2014 at 1:22 Comment(11)
For what it's worth, I would assume that most Java XML tools will report offsets in UTF-16, not in bytes -- because they're letting the I/O library deal with the encoding and seeing it only as Java-native UTF-16 characters. I suppose you could kluge by pre-converting the document to UTF16 so the UTF-16 offset is exactly 2x the byte offset, but for ASCII documents you'd double their size and you're already saying this is too large.Ezmeralda
Cool requirement and well explained. +1Touristy
Actually... if we did have the ability to normalise, converting to UTF-16 is exactly what I would do. Then we could use the StAX solution, as the offsets would match. The catch here is that we don't have a convenient place (yet!) to store this sort of derived input file. And we have a mess of a system where one process will parse the file the first time, but the guy coming in to fetch the element might be a different Java process. :)Jerrodjerrol
extended vtd-xml do not load everything in memory... it uses memory mapping... FYIEric
Hmm... so now VTD-XML might work now as long as you are OK with waiting for your XML to copy into a file so that you can memory map it.Jerrodjerrol
Hi, I recently answered #43367066 which seems to work.Schistosome
As a degraded mode, I once had a XML pipeline of the same sort, but I had confidence over the file and its namespace declarations. So I wrote a simple wrapper over InputStream that would look for the offset of the sequence I was interested in. Very lightweight, very fast. But not XML compliant - therefore there would have been ways to break it with a perfectly legal XML file. I just knew for sure that would not happen.Rivulet
@Trejkaz what happens if you feed the XMLInputFactory with a Reader, and not an InputStream, does it behave reliably (in terms of character counts) ?Rivulet
@Rivulet That's a good question. It's been too long to really remember. It probably did give the right character counts, but knowing the character count doesn't really help with the goal of seeking immediately to the right byte in the file, so I imagine we just couldn't use it anyway.Jerrodjerrol
@Trejkaz : sorry, the question popped in my feed, I did not see it was that old. Anyway, you can Reader#skip(long n) if you know how many chars to skip. It sure is slower than skipping bytes from a file (O(n) vs. 0(1)), but depending on what you want to do, maybe fast enough (e.g. if you batch process the whole file, the O(n) is amortized).Rivulet
@Rivulet the rough use case was to process only one element in a multi-gigabyte XML file without performing unnecessary reads from potentially-slow disk. We had found that reading the data without parsing it (required in order to skip it) was just as slow as parsing it, so we were looking at cunning ways to record an offset into the file so that we can just seek back to the right element to process.Jerrodjerrol
S
2

Some time ago I created this approach for fun. Maybe it will help you. It basically does the following.

  1. Create a self generated XML parser with ANTLR
  2. Hook into the parsing routine to emit byte offsets
  3. Use random access to stream from each byte offset into a prepared POJO using Jackson.

For complete example look into Using StAX to create index for XML for quick access

Schistosome answered 4/10, 2018 at 7:40 Comment(3)
This is very cool, if a bit heavy on libraries :-). Is the ANTLR XML grammar 100% trustworthy, W3C maintained, ... (I have just skimmed the surface) ?Rivulet
@Rivulet I'm not sure about the status of the ANTLR XML grammar. Maybe ask the original maintainer on github?Schistosome
With ANTLR, since it can't just parse binary, you'd essentially be using ISO-8859-1 and then taking the strings it gives you, converting back to bytes and then applying the real encoding to get back the right string. I think if we were going to make our own parser, though, this is definitely the way we'd have to do it, unless there were something exactly like ANTLR which didn't convert to chars up-front.Jerrodjerrol
E
1

Possible approach:

1) Open the file as a byte stream.

2) Wrap an input stream/reader around that which (a) converts from UTF-8 to UTF-16, but (b) in the process, tracks which Java characters are basic ASCII range and which are 2-byte UTF16. (I can think of several ways to keep the memory requirements of that tracking down to something reasonable.)

3) When you need a file offset, use that tracking table to back-convert from Java UTF-16 character count to byte count.

Can't think of any reason why it wouldn't work...

Ezmeralda answered 9/1, 2014 at 1:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.