How to track the source line (location) of an XML element?
Asked Answered
A

4

10

I assume that there is probably no satisfactory answer to this question, but I ask it anyway in case I missed something.

Basically, I want to find out the line in the source document from which a certain XML element originated, given the element instance. I want this only for better diagnostic error messages - the XML is part of a configuration file, and if there is something wrong with it, I want to be able to point the reader of the error message to exactly the right place in the XML document so he can correct the error.

I understand that the standard Scala XML support probably has no built-in feature like this. After all, it would be wasteful to annotate every single NodeSeq instance with such information, and not every XML element even has a source document from which it has been parsed. It seems to me that the standard Scala XML parser throws the line information away, and later on there is no way to retrieve it.

But switching to another XML framework is not an option. Adding another library dependency "only" for the sake of better diagnostic error messages seems inappropriate to me. Also, despite some shortcomings, I really like the built-in pattern matching support for XML.

My only hope is that you can show me a way to alter or subclass the standard Scala XML parser such that the nodes it produces will be annotated with the number of the source line. Maybe a special subclass of NodeSeq can be created for this. Or maybe only Atom can be subclassed because NodeSeq is too dynamic? I don't know.

Anyway, my hopes are close to zero. I don't think there is a place in the parser where we can hook in to change the way nodes are created, and that at that place the line information is available. Still, I wonder why I have not found this question before. Please point me to the original if this is a duplicate.

Actual answered 15/12, 2010 at 2:14 Comment(0)
A
11

I had no idea how to do that, but Pangea showed me the way. First, let's create a trait to handle location:

import org.xml.sax.{helpers, Locator, SAXParseException}
trait WithLocation extends helpers.DefaultHandler {
    var locator: org.xml.sax.Locator = _
    def printLocation(msg: String) {
        println("%s at line %d, column %d" format (msg, locator.getLineNumber, locator.getColumnNumber))
    }

    // Get location
    abstract override def setDocumentLocator(locator: Locator) {
        this.locator = locator
        super.setDocumentLocator(locator)
    }

    // Display location messages
    abstract override def warning(e: SAXParseException) {
        printLocation("warning")
        super.warning(e)
    }
    abstract override def error(e: SAXParseException) {
        printLocation("error")
        super.error(e)
    }
    abstract override def fatalError(e: SAXParseException) {
        printLocation("fatal error")
        super.fatalError(e)
    }
}

Next, let's create our own loader overriding XMLLoader's adapter to include our trait:

import scala.xml.{factory, parsing, Elem}
object MyLoader extends factory.XMLLoader[Elem] {
    override def adapter = new parsing.NoBindingFactoryAdapter with WithLocation
}

And that's all there is to it! The object XML adds little to XMLLoader -- basically, the save methods. You might want to look at its source code if you feel the need for a full replacement. But this is only if you want to handle all of this yourself, since Scala already have a trait to produce errors:

object MyLoader extends factory.XMLLoader[Elem] {
    override def adapter = new parsing.NoBindingFactoryAdapter with parsing.ConsoleErrorHandler
}

The ConsoleErrorHandler trait extract its line and number information from the exception, by the way. For our purposes, we need the location outside exceptions too (I'm assuming).

Now, to modify node creation itself, look at the scala.xml.factory.FactoryAdapter abstract methods. I have settled on createNode, but I'm overriding at the NoBindingFactoryAdapter level, because that returns Elem instead of Node, which enables me to add attributes. So:

import org.xml.sax.Locator
import scala.xml._
import parsing.NoBindingFactoryAdapter
trait WithLocation extends NoBindingFactoryAdapter {
    var locator: org.xml.sax.Locator = _

    // Get location
    abstract override def setDocumentLocator(locator: Locator) {
        this.locator = locator
        super.setDocumentLocator(locator)
    }

    abstract override def createNode(pre: String, label: String, attrs: MetaData, scope: NamespaceBinding, children: List[Node]): Elem = (
        super.createNode(pre, label, attrs, scope, children) 
        % Attribute("line", Text(locator.getLineNumber.toString), Null) 
        % Attribute("column", Text(locator.getColumnNumber.toString), Null)
    )
}

object MyLoader extends factory.XMLLoader[Elem] {
    // Keeping ConsoleErrorHandler for good measure
    override def adapter = new parsing.NoBindingFactoryAdapter with parsing.ConsoleErrorHandler with WithLocation
}

Result:

scala> MyLoader.loadString("<a><b/></a>")
res4: scala.xml.Elem = <a line="1" column="12"><b line="1" column="8"></b></a>

Note that it got the last location, the one at the closing tag. That's one thing that can be improved by overriding startElement to keep track of where each element started in a stack, and endElement to pop from this stack into a var used by createNode.

Nice question. I learned a lot! :-)

Arsenal answered 15/12, 2010 at 12:7 Comment(2)
Sorry for answering so late. Your answer is brilliant. I didn't expect a real solution, but you actually found one. Thanks a lot!Actual
Now if only you or someone can show how to get the start line number :PGraniela
C
4

I see that scala internally uses SAX for parsing. SAX allows you to set a Locator on the ContentHandler, which can be used to retrieve the current location where the error occurred. I am not sure how you can tap into the internal workings of Scala though. Here is one article I found that might be of some help to see if this is doable.

Civilization answered 15/12, 2010 at 4:52 Comment(2)
For what it is worth, Stax XMLStreamReader has getLocation() which likewise gives location (input (filename), row, column). JDK 1.6 comes with a default implementation (Sun Sjsxp), although there are better open source alternatives (Woodstox) available too.Vocable
Agree but i am not sure i stax is supported in Scala.Civilization
E
2

I don't know anything about Scala, but the same issue pops up in other environments. For example, an XML transformation sends its results down a SAX pipeline to a validator, and when the validator tries to find line numbers for its validation errors, they're gone. Or the XML in question was never serialized or parsed, and therefore never had line numbers.

One way to address the problem is by generating (human-readable) XPath expressions to say where the error occurred. These are not as easy to use as line numbers but they're a lot better than nothing: they uniquely identify a node, and they're often pretty easy for humans to interpret (especially if they have an XML editor).

For example, this XSLT template by Ken Holman (I think) used by Schematron generates an XPath expression to describe the location/identity of the context node:

<xsl:template match="node() | @*" mode="schematron-get-full-path-2">
   <!--report the element hierarchy-->
   <xsl:for-each select="ancestor-or-self::*">
      <xsl:text>/</xsl:text>
      <xsl:value-of select="name(.)"/>
      <xsl:if test="preceding-sibling::*[name(.)=name(current())]">
         <xsl:text>[</xsl:text>
         <xsl:value-of
            select="count(preceding-sibling::*[name(.)=name(current())])+1"/>
         <xsl:text>]</xsl:text>
      </xsl:if>
   </xsl:for-each>
   <!--report the attribute-->
   <xsl:if test="not(self::*)">
      <xsl:text/>/@<xsl:value-of select="name(.)"/>
   </xsl:if>
</xsl:template>

I don't know if you can use XSLT in your scenario, but you could apply the same principle with whatever tools you have available.

Eusebiaeusebio answered 15/12, 2010 at 4:25 Comment(0)
V
2

Although you indicated that you would not want to use different library or framework, it is worth noting that all good Java streaming parsers (Xerces for Sax, Woodstox and Aalto for Stax) do make location information available for all events/tokens they serve.

Although this information is not always retained by higher-level abstractions like DOM trees (due to additional storage needed; performance isn't big concern since location information is always tracked as it is needed for error reporting anyway) this may be easy or at least possible to fix.

Vocable answered 15/12, 2010 at 6:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.