XML / Java: Precise line and character positions whilst parsing tags and attributes?

Asked 31/1, 2017 at 22:2 Answered 19/2, 2017 at 9:20

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web interface) where the document is invalid.

Ultimately I want to set the caret in a to be at the invalid tag or just inside the open quote of the invalid attribute. (I’m not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. I may even want report some attributes as being invalid part-way through the attribute’s value. Or similarly, part-way through the text between a start and end tag.)

I’ve tried using SAX (org.xml.sax) and the Locator interface. This works up to a point but isn’t nearly good enough. It will only report the read position after an event; for example, the character immediately after an open tag ends, for startElement(). I can’t just subtract back the length of the tag name because attributes, self-closing tags and/or newlines within the open tag will throw this out. (And Locator provides no information about the position of attributes at all.)

Ideally I was looking to use an event-based approach, as I already have a SAX handler that is building an in-house DOM-like representation or further processing. However, I would be interested in knowing about any DOM or DOM-like library that includes exact position information for the model’s elements.

Has any one solved this issue, or any like it, with the required level of precision?

Hunger answered 31/1, 2017 at 22:2 Comment(7)

An event-based approach? Like XMLEventReader, and the XMLEvent.getLocation method? – Gisellegish 1/2, 2017 at 14:51

I've tried using not XMLEventReader, but XMLStreamReader. However the positions this reports are the end position of each event. So, for example, after a START_ELEMENT the position indicated is immediately after the close of the start tag (note - start tag, not element). There appears to be no reliable way to determine the position of the start of the tag. Also, I never get any ATTRIBUTE events at all as these a coalesced into a single START_ELEMENT event: so I can't get any further accuracy on the attributes positions either. – Hunger 15/2, 2017 at 22:37

Please explain what you mean when you say you're not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. – Tace 19/2, 2017 at 4:8

Re "Please explain..." Some of the attribute values will be 'micro-languages' that need to be parsed and checked. For example, coords="0,0; 10,0; 10,10; 0;10". If I can determine the exact (line, char) position of the first quote then it will be easy to additionally parse the values of the attribute and indicate exactly where any errors occur. – Hunger 21/2, 2017 at 17:4

@Paul, I am having the exact same frustration as you; I would like an XML parser that gives me the start and end position of each element, attribute and text section so that I can write a syntax highlighter, and I cannot find anything off-the-shelf that does that in Java. Did you ever find a solution, or did you write your own lexer? – Viipuri 28/6, 2018 at 23:29

@Eric For my own purposes I've switched to using a home-brewed form of wiki markup, though I may need to look at the XML-based approach again in future. I did find an XMLScanner class (in Batik) that may work, but I never got around to trying it. xmlgraphics.apache.org/batik/javadoc/org/apache/batik/xml/… – Hunger 12/7, 2018 at 14:27

@Paul: Thanks; I ended up using the DOMParser, which gives you barely enough information. It gives you the location of the last character that the parser looked at. Since < is illegal inside an element, you can then look backwards from that position for the start of the element, and then lex it from there. I don't know why they didn't simply put the character location in the element when it was parsed! – Viipuri 12/7, 2018 at 15:31

XML parsers will (and should) smooth over certain things like additional whitespace, so exact mapping back to the character stream is not feasible.

You should rather look into getting a lexer or 'token stream generator' for increased detail, in other words go to the detail level below XML parsers.

There is a few general frameworks for writing lexers in java. This ANTLR 3-based page has a nice overview of lexer vs parser and section one some rudimentory XML Lexer examples.

I'd also like to comment that for a user with a web interface, maybe you should consider a pure client-side (i.e. javascript) solution.

Overglaze answered 19/2, 2017 at 9:20 Comment(3)

Thanks. I've used ANTLR before but I'm not a fan. I am coming around to the idea that I might have to write a lexer myself. – Hunger 21/2, 2017 at 17:7

An interactive JavaScript interface is a good longer-term idea. Right now though I am trying to create what is effectively a wiki-editing feature using embedded islands of XML for the more complicated markup - and these need parsing and validating when the user saves. – Hunger 21/2, 2017 at 17:9

Don't write your own, rather hack something like github.com/FasterXML/aalto-xml/blob/master/src/main/java/com/… – Overglaze 21/2, 2017 at 20:9

I wrote a quick xml file that gets the line numbers and throws an exception in the case of an unwanted attribute and gives the text where the error was thrown.

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Stack;


import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.apache.log4j.Logger;
import org.w3c.dom.Document;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;



public class LocatorTestSAXReader {
private static final Logger logger =     Logger.getLogger(LocatorTestSAXReader.class);

    private static final String XML_FILE_PATH = "lib/xml/test-instance1.xml";

public Document readXMLFile(){

    Document doc = null;
    SAXParser parser = null;

    SAXParserFactory saxFactory = SAXParserFactory.newInstance();
    try {
        parser = saxFactory.newSAXParser();
        DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
        doc = docBuilder.newDocument();

    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }


    StringBuilder text = new StringBuilder();
    DefaultHandler eleHandler = new DefaultHandler(){
        private Locator locator;

        @Override 
        public void characters(char[] ch, int start, int length){
            String thisText = new String(ch, start, length);
            if(thisText.matches(".*[a-zA-z]+.*")){
                text.append(thisText);
                logger.debug("element text: " + thisText);
            }

        }



        @Override
        public void setDocumentLocator(Locator locator){
            this.locator = locator;
        }

        @Override
        public void startElement(final String uri, final String localName, final String qName, 
                final Attributes attributes)
                    throws SAXException {
            int lineNum = locator.getLineNumber();
            logger.debug("I am now on line " + lineNum + " at element " + qName);

            int len = attributes.getLength();
            for(int i=0;i<len;i++){
                String attVal = attributes.getValue(i);
                String attName = attributes.getQName(i);

                logger.debug("att " + attName + "=" + attVal);

                if(attName.startsWith("bad")){
                    throw new SAXException("found attr : " + attName + "=" + attVal + " that starts with bad! at line : " + 
                locator.getLineNumber() + " at element " + qName +   "\nelement occurs below text : " + text);
                }
            }

        }




    };

    try {
        parser.parse(new FileInputStream(new File(XML_FILE_PATH)), eleHandler);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return doc;
    }


}

with regards to the text, depending on where in the xml file the error occurs, there may not be any text. So with this xml:

<?xml version="1.0"?>
<root>
  <section>
    <para>This is a quick doc to test the ability to get line numbers via the Locator object. </para>
  </section>    
  <section bad:attr="ok">
    <para>another para.</para>
  </section>
</root>

if the bad attr is in the first element the text will be blank. In this case, the exception thrown was:

org.xml.sax.SAXException: found attr : bad:attr=ok that starts with bad! at line : 6 at element section
element occurs below text : This is a quick doc to test the ability to get line numbers via the Locator object.

When you say you tried using the Locator object, what exactly was the problem?

Speechless answered 1/2, 2017 at 1:39 Comment(2)

I want to know (using your example) the exact line and column position of the 'b' of "bad:attr". Or - if the value of the attribute is the problem - either the open quote or 'o' of "ok". – Hunger 15/2, 2017 at 22:30

But in other cases it might be the exact position of "<section>" if, for example, <section> was not a valid element inside <root>. Or the 'a' of "another para." if, say, "another para." was not a valid string for be found between <para></para>. In general, I want to know the exact line and column position of start/end tags, runs of text, attribute names and attribute values. – Hunger 15/2, 2017 at 22:33

Recommended topics

Hot tags