How can I force a SAX parser (specifically, Xerces in Java) to use a DTD when parsing a document without having any doctype in the input document? Is this even possible?
Here are some more details of my scenario:
We have a bunch of XML documents that conform to the same DTD that are generated by multiple different systems (none of which I can change). Some of these systems add a doctype to their output documents, others do not. Some use named character entities, some do not. Some use named character entities without declaring a doctype. I know that's not kosher, but it's what I have to work with.
I'm working on system that needs to parse these files in Java. Currently, it's handling the above cases by first reading in the XML document as a stream, attempting to detect if it has a doctype defined, and adding a doctype declaration if one isn't already present. The problem is that this code is buggy, and I'd like to replace it with something cleaner.
The files are large, so I can't use a DOM-based solution. I'm also trying get character entities resolved, so it doesn't help to use an XML Schema.
If you have a solution, could you please post it directly instead of linking to it? It doesn't do Stack Overflow much good if in a the future there's a correct solution with a dead link.