Java XML Parsing and original byte offsets
Asked Answered
L

2

10

I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.

For example, if I had an XML document with the content something like:

<html>
<body>
<div>text</div>
</body>
</html>

I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.

Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?

Lexington answered 17/8, 2010 at 22:5 Comment(2)
Take a look at this question stackoverflow.com/questions/43366566 to find character offsets in large XML files and how to use with JAXB.Reveille
See also JAXB location in file for unmarshalled objectsMontcalm
N
6

The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator interface. When you use the SAX API, you subclass DefaultHandler and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator into your DefaultHandler via setDocumentLocator(). As the parsing proceeds, the various callback methods on your ContentHandler are invoked (e.g. startElement()), at which point you can consult the Locator to find out the parsing position (via getColumnNumber() and getLineNumber())

Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.

Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.

edit: Found this example.

Neptunian answered 17/8, 2010 at 22:24 Comment(0)
L
2

Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.

import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;

public class Runner {

public static void main(String argv[]) {

    XMLInputFactory factory = XMLInputFactory.newInstance();
    try{
    XMLStreamReader streamReader = factory.createXMLStreamReader(
           new FileReader("D:\\BigFile.xml"));

    while(streamReader.hasNext()){
        streamReader.next();
        if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
            Location location = streamReader.getLocation();
            System.out.println("byte location: " + location.getCharacterOffset());
            }
        }
    } catch(Exception e){
        e.printStackTrace();
    }
Lading answered 30/10, 2014 at 12:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.