How can I parse and analyze a DTD file in Java?
Asked Answered
H

3

5

I would like to implement a program in Java in order to take as input a DTD file and output an XML instance file based on the DTD.

That means that I have to parse and analyze in Java the DTD file. Is there any API available online that defines methods for analyzing the structure and the elements in the DTD file?

thanks

Hulda answered 15/10, 2014 at 20:26 Comment(3)
Here is useful link. #8700120Contessacontest
Well, thanks for the link but in my case no XML file is given at the beginning. I have only a DTD file and I have to produce an XML file based on the DTD file. In the link you posted me, the user has already a DTD and an XML file in the beginning, so I'm afraid that the post does not help me, since the user wants to validate a given XML file based again on a given DTD file.Hulda
See also: #17606Booklover
M
6

A dirty solution to parsing a DTD would be to abuse Xerces internals. You could use it as a starting point to something acceptable as it is already available in recent JREs, source code is available (with JDK or from Apache), and can be modified to your liking (Apache license). Note that for real world DTDs with external entities etc. you would have to configure the XMLDTDLoader with adapters (e.g. setEntityResolver/Feature/Property).

Here is some standalone code to try it out (which seems to work on OpenJDK 1.7.0 and Oracle JDK 1.8.0 for me):

import org.xml.sax.InputSource;
import com.sun.org.apache.xerces.internal.impl.dtd.DTDGrammar;
import com.sun.org.apache.xerces.internal.impl.dtd.XMLDTDLoader;
import com.sun.org.apache.xerces.internal.util.SAXInputSource;
import com.sun.org.apache.xerces.internal.xni.parser.XMLInputSource;

public class So26391485 {
    public static void main(String[] args) throws Exception {
        // minimal example DTD
        StringWriter sw = new StringWriter();
        sw.write("<!DOCTYPE html [");
        sw.write("  <!ELEMENT html (head, body)>");
        sw.write("  <!ELEMENT head (title)> <!ELEMENT title (#PCDATA)>");
        sw.write("  <!ELEMENT body (p+)> <!ELEMENT p (#PCDATA)>");
        sw.write("]>");

        // read DTD
        InputStream dtdStream = new ByteArrayInputStream(sw.toString().getBytes());
        //InputStream dtdStream = So26391485.class.getResourceAsStream("your.dtd");
        Scanner scanner = new Scanner(dtdStream);
        String dtdText = scanner.useDelimiter("\\z").next();
        scanner.close();

        // DIRTY: use Xerces internals to parse the DTD
        Pattern dtdPattern = Pattern.compile("^\\s*<!DOCTYPE\\s+(\\S+)\\s*\\[(.*)\\]>\\s*$", Pattern.DOTALL);
        Matcher m = dtdPattern.matcher(dtdText);
        if (m.matches()) {
            String docType = m.group(1);
            InputSource is = new InputSource(new StringReader(m.group(2)));
            XMLInputSource source = new SAXInputSource(is);
            XMLDTDLoader d = new XMLDTDLoader();
            DTDGrammar g = (DTDGrammar) d.loadGrammar(source);
            g.printElements();
        }
    }
}

(I had to chop off the DOCTYPE declaration because I did not manage to have Xerces read the DTD as is. After all XMLDTDLoader was not meant to be used like that ...)

Manado answered 16/10, 2014 at 22:25 Comment(0)
B
4

Another option is com.sun.xml.dtdparser.DTDParser which is used in the JAXB schema compiler. It has a nice-looking com.sun.xml.dtdparser.DTDParser.parse(InputSource) method. I could not find any examples for that, but the usage is probably:

// Gets the DTD events
DTDEventListener listener = ...;
// Instantiate the parser
DTDParser parser = new DTDParser();
// Set the nandler
parser.setDtdHandler(reader);
// Parse your DTD source
parser. parse(...);

However I'd first try Xerces (see the other answer) as this DTD parser seems to be quite old. I think this was even me who mavenized it ages ago.

Generally, the task fo generation a sample XML file based on a DTD or XML Schema is not easy, as far as remember, this was a PhD-level resarch topic around 2000. I could not find a link but there was a very nice research paper from IBM if I am not mistaken.

Nowadays, I'd not take DTD but rather XML Schema as basis.

Booklover answered 16/10, 2014 at 22:45 Comment(2)
This project is now at github.com/eclipse-ee4j/jaxb-dtd-parserNeritic
And here is the maven artifact: mvnrepository.com/artifact/com.sun.xml.dtd-parser/dtd-parserFazeli
P
1

There is no standard API or data model for reading/manipulating/writing DTDs or XML Schemas, unfortunately. Your best bet is to look for a parser which offers a custom API for the purpose, or to just manipulate a Schema as an XML document and build your own data model for it.

Generating "an XML instance file based on the DTD" is generally a very poorly-defined problem. There are entirely too many possible document for any given DTD, and that's without considering the fact that you probably want the data content to be semantically meaningful too. You can do a bit better with XML Schemas, but even then producing a Valid document is only the tip of the iceberg of producing a correct document. It's possible to write editing tools that will help a user produce a well-formed document, but even that can be messy since the easiest editing path between two Valid documents may be through invalid documents. Tools have been written which do this, but they're not widely used because in most cases, when you want that much assistance, you want to go whole hog and write an editor which is completely aware of the document semantics, including things the DTD or Schema can't express.

Precursor answered 16/10, 2014 at 1:16 Comment(4)
"There is no standard API or data model for reading/manipulating/writing DTDs or XML Schemas, unfortunately." Depends on what you call "standard". There's no javax.* API, but Xerces or XSOM are quite standard.Booklover
There are parser-specific APIs. There isn't anything parser-independent, never mind language-independent, as far as I know. (We were discussing adding this capability to the DOM, but it ran into the rocks of known incompatibilities between DTDs and Schemas.)Precursor
I'm not sure what you mean "parser-specific". XSOM is specifically for processing XML Schemas and has a very nice API. There was also MSV (Multi-Schema Validator), I think, also written by Kohsuke, now already old as hell. JAXB XJC supports quite a few XML "schema" languages like XML Schema, DTD, RelaxNG, so they do have internal representation models for that. Of course you're right, there is no unified API but I think there ar worthy tools.Booklover
I'm just pointing out that that there's no portable solution; you have to commit to a specific implementation of a specific tool. Which is undesirable, but Oh Well.Precursor

© 2022 - 2024 — McMap. All rights reserved.