How to parse a DTD file in Ruby
Asked Answered
B

1

8

I was trying to convert a DTD file to a YAML file, and I've tried loading it both in libXML and Nokogiri, but it seems that a DTD file is not a valid XML file. I'm fine with using any third-party gems as long as I can parse the DTD file.

My attempt at conversion:

wget "http://xml.evernote.com/pub/enml2.dtd"
irb
require 'nokogiri'
xml = Nokogiri::XML::Document.parse('enml2.dtd')
xml.to_yaml
=> "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n  message: |\n    Start tag expected, '<' not found\n  domain: 1\n  code: 4\n  level: 3\n  file: \n  line: 1\n  str1: \n  str2: \n  str3: \n  int1: 0\n  column: 1\n"

Any online XML validator also returns the error "Start tag expected". I assume it is because all valid XML docs start with <?xml, which DTD files seem to be missing. This is what has led me to the conclusion that all DTD files are invalid XML files, however, it does feel weird that the XML definition syntax itself was not defined as valid XML. Why?

I'm parsing the DTD file to remove invalid attributes from an XML file, to know which attributes to keep and which to remove, so I need a way to parse the DTD file.

And ultimately, this is all just a step in trying to convert HTML to ENML (Evernote Markup Language). The steps involved in it include:

  • Converting HTML to valid XHTML
  • Converting the body to an en-note element
  • Removing invalid tags and attributes as per the dtd file
  • Validating the enml file against the dtd

I'm currently thinking to just copy the disallowed attributes and tags from "Understanding the Evernote Markup Language" and using that to validate my XHTML, but I'd prefer to use the DTD as my source.

The Nokogiri DTD class is a Node class for holding an inline DTD node and validating against it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support. And even if it did work, all I would get is validation.

I did get validation to work properly using:

#dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd')
#enml_document = XML::Document.string enml
#ret = enml_document.validate dtd

I haven't tried REXML. I will give that a go and report back.

I'm trying to convert an HTML document to a XML document that validates with the given DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to strip them, or remove them. I also need to know which attributes are allowed and which are not, so that I can parse the XML properly and remove/sanitize the offending elements and attributes.

For the cleanup purpose, I'm using Loofah, but to use it, I need a list of tag->attributes (which attributes are available for each tag). Instead of making multiple passes validating the doc, which I am doing at the end of cleanup, I'm just looping through each XML tag, and cleaning them up. But to know how to clean them, I need to know which tags and elements are supported in the valid schema. Thus, I need to parse the DTD file.

From what I understand, XLST is the right tool for the job, but I'm not comfortable enough to use it.

Bloodstain answered 12/7, 2014 at 16:31 Comment(8)
Can you share what you've tried with both libxml and nokogiri? For example, have you used Nokogiri::XML::DTD (nokogiri.org/Nokogiri/XML/DTD.html) or played w/ REXML's DTD parser (rubydoc.info/stdlib/rexml/REXML/DTD/Parser)Penthea
Your input XML file is malformed to begin with? I'm a little fuzzy on what precisely you're doing; it sounds like you're doing transformations, which doesn't need to match a DTD in the first place--use XSLT or a saner equivalent and filter out what you don't want/add what you do, etc.Fewness
I'm just trying to convert a DTD file to a ruby hash. How would you go about that? (I'll look into XSLT meanwhile)Bloodstain
I've added some more details on what I'm trying to do.Bloodstain
What do you mean by "perform custom cleanup"? Why isn't it enough to simply validate the XML file against the DTD?Chipboard
I'm sorry, but this question is getting more confusing for every update. For what job do you think XSLT is the right tool? One job for which it is NOT the right tool is parsing DTDs. I have shown how to parse and validate an XML file using Ruby. I have tried to explain that you don't need to parse the DTD separately. But you keep insisting that you must do this. I'm running out of things to say.Chipboard
Someone mentioned about using XLST to transform xml docs from one schema to another, so I thought that might be applicable here. Sorry about the confusion. I'll attach more code in a bit to help you understand what I'm trying to do.Bloodstain
Here's some code to help you understand the problem: gist.github.com/captn3m0/97a672e5dbc69e7d2015Bloodstain
C
3

However, it does feel weird to me that the xml definition syntax itself was not defined as valid XML. I'd love to know any reasons behind this.

DTDs are a holdover from SGML, the precursor of XML, so it is actually not very strange that DTDs are not XML files. Keeping DTDs and their particular syntax was a deliberate decision when XML was created.

More modern schema languages such as W3C XML Schema and RELAX NG do use XML syntax.


The reason I'm parsing the DTD file is that I want to remove invalid attributes from an XML file. To know which attributes to keep and which to remove, I need a way to parse the DTD file. (from question)

I am just looking for a way to parse DTD files, not just validate using them, because I want to perform custom cleanup and validation using the dtd. (from bounty text)

I don't really understand what you mean by "custom cleanup". I also don't see the point in trying to parse the DTD in the first place.

In order to find out if any elements or attributes in an XML file are invalid (if they break the rules in an associated DTD), you need to parse the XML file using a validating XML parser. The parser will then tell you if there are any errors that need to be fixed.

Nokogiri is based on libxml2 which provides a validating parser. It does support external DTDs that are specified using <!DOCTYPE foo SYSTEM "bar.dtd"> syntax (how to make this work is shown in a comment on the issue that you refer to: https://github.com/sparklemotion/nokogiri/issues/440#issuecomment-3031164).

Here is how the validation can be done:

require 'nokogiri'

xml = File.read("yourfile.xml")
options = Nokogiri::XML::ParseOptions::DTDLOAD   # Needed for the external DTD to be loaded
doc = Nokogiri::XML::Document.parse(xml, nil, nil, options)
puts doc.external_subset.validate(doc) 

If there is no output from this code, then the XML document is valid against the DTD.

Chipboard answered 8/8, 2014 at 15:40 Comment(10)
Basically, I need to make the validation pass by removing invalid elements and attributes. To know which elements to remove and which to keep, I need to parse the DTD file.Bloodstain
No, you don't need to parse the DTD. You need to parse the XML file as I have shown. If the validation does not pass, the parser will tell you.Chipboard
Does a single validation output tell me about all the issues with the document in a single pass?Bloodstain
Well, yes, it depends. Be prepared to run the validation multiple times, fixing issues iteratively.Chipboard
I converted the dtd by hand and with some help from matra, but I'd like to see a pure-ruby solution.Bloodstain
It is not easy to understand what you really want. Matra is a tool for visualizing DTDs; it is not an XML validator. How exactly are you going to use the result output by Matra (or an equivalent Ruby program)? I have already asked what you mean by "perform custom cleanup", but you never clarified that.Chipboard
I clarified the custom cleanup bit in the question.Bloodstain
Did you see the gist I posted (gist.github.com/captn3m0/97a672e5dbc69e7d2015) ?Bloodstain
Yes, I saw it. I'm very sorry, but I'm not sure I can be bothered to put more effort into this issue. The question and the exchange of comments here remind me of the "XY problem": someone has problem X and thinks Y will solve it. Instead of asking for help with problem X, the person asks about Y, which leads to confusion. See meta.stackexchange.com/questions/66377/what-is-the-xy-problem.Chipboard
Hmm. Thanks for the effort anyway. I'll award the bounty to you for the effort you've put in.Bloodstain

© 2022 - 2024 — McMap. All rights reserved.