Java - Read XML and leave all entities alone

Asked 12/9, 2011 at 9:42 Answered 12/9, 2011 at 12:26

I want to read XHTML files using SAX or StAX, whatever works best. But I don't want entities to be resolved, replaced or anything like that. Ideally they should just remain as they are. I don't want to use DTDs.

Here's an (executable, using Scala 2.8.x) example:

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

Given the following xhtml file ...

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            &lt;div class=&quot;header&quot;&gt;
        </p>
        <p id="stuff">
            &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;
        </p>
        Das war's!
    </body>
</html>

... running scala stax-test.scala stax-test.xhtml will result in:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

So all entities have been replaced more or less sucessfully. What I would have expected and what I want is this, though:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      &lt;div class=&quot;header&quot;&gt;


      &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

Is this even possible? I want to parse XHTML, do some modifications and then output it like that as XHTML again. So I really want the entities to remain in the result.

Also I don't get why Uuml is reported as an EntityReference event while the rest aren't.

Phalangeal answered 12/9, 2011 at 9:42 Comment(0)

A bit of terminology: ũ is a numeric character reference (not an entity), and &#auml; is an entity reference (not an entity).

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. There are special events notified only to the LexicalHandler rather than to the ContentHandler.

Tightwad answered 12/9, 2011 at 12:26 Comment(0)

The answer to "why Uuml is reported as an EntityReference event while the rest aren't" is that the rest are defined by the XML spec, while Ü is specific to HTML 4.0.

Since your goal is to write modified XHTML, it may be possible to force the serializer to emit numeric entity references by setting the "encoding" to "US-ASCII" and/or the "method" to "html". The XSLT spec (which underlies Java XML serializers) says that the serializer "may output a character using a character entity reference" when the method is html. Setting the encoding to ASCII may force it to use numeric entities if named entities aren't supported.

Fusible answered 12/9, 2011 at 11:59 Comment(0)

-2

In Java I would use a regular expression.

public static void main(String... args) throws IOException {
  BufferedReader buf = new BufferedReader(new FileReader(args[0]));
  Pattern entity = Pattern.compile("&([^;]+);");
  Set<String> entities = new LinkedHashSet<String>();
  for (String line; (line = buf.readLine()) != null; ) {
    Matcher m = entity.matcher(line);
    while (m.find())
      entities.add(m.group(1));
  }
  buf.close();
  System.out.println("Entities: " + entities);
}

prints

Entities: [lt, quot, gt, Uuml, #169]

Quarantine answered 12/9, 2011 at 9:52 Comment(2)

And like nearly everyone who attempts to parse XML using regular expressions, you would be wrong. For example, your regex will pick up entity-like things appearing in comments and CDATA sections; and if a comment contains an ampersand with no following semicolon it will cause havoc. Never use regular expressions to parse XML - you will always get it wrong. Downvoting. – Tightwad 12/9, 2011 at 12:31

@Michael Kay, That is a good explanation as to why it can be bad. I suspect you have come across more "wild" XML than I have. The XML I have seen is usually designed for a purpose. – Quarantine 12/9, 2011 at 13:24

Recommended topics

Hot tags