I want to read XHTML files using SAX or StAX, whatever works best. But I don't want entities to be resolved, replaced or anything like that. Ideally they should just remain as they are. I don't want to use DTDs.
Here's an (executable, using Scala 2.8.x) example:
import javax.xml.stream._
import javax.xml.stream.events._
import java.io._
println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)
println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
val event = xer.nextEvent
if (event.isCharacters) {
print(event.asCharacters.getData)
} else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
entities += event.asInstanceOf[EntityReference].getName
}
}
println("------")
println("Entities: " + entities.mkString(", "))
Given the following xhtml file ...
<html>
<head>
<title>StAX Test</title>
</head>
<body>
<h1>Hallo StAX</h1>
<p id="html">
<div class="header">
</p>
<p id="stuff">
Überdies sollte das hier auch als Copyright sichtbar sein: ©
</p>
Das war's!
</body>
</html>
... running scala stax-test.scala stax-test.xhtml
will result in:
StAX Test - stax-test.xhtml
------
StAX Test
Hallo StAX
<div class="header">
berdies sollte das hier auch als Copyright sichtbar sein: ?
Das war's!
------
Entities: Uuml
So all entities have been replaced more or less sucessfully. What I would have expected and what I want is this, though:
StAX Test - stax-test.xhtml
------
StAX Test
Hallo StAX
<div class="header">
Überdies sollte das hier auch als Copyright sichtbar sein: ©
Das war's!
------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169
Is this even possible? I want to parse XHTML, do some modifications and then output it like that as XHTML again. So I really want the entities to remain in the result.
Also I don't get why Uuml is reported as an EntityReference event while the rest aren't.