I was trying to convert a DTD file to a YAML file, and I've tried loading it both in libXML and Nokogiri, but it seems that a DTD file is not a valid XML file. I'm fine with using any third-party gems as long as I can parse the DTD file.
My attempt at conversion:
wget "http://xml.evernote.com/pub/enml2.dtd"
irb
require 'nokogiri'
xml = Nokogiri::XML::Document.parse('enml2.dtd')
xml.to_yaml
=> "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n message: |\n Start tag expected, '<' not found\n domain: 1\n code: 4\n level: 3\n file: \n line: 1\n str1: \n str2: \n str3: \n int1: 0\n column: 1\n"
Any online XML validator also returns the error "Start tag expected". I assume it is because all valid XML docs start with <?xml
, which DTD files seem to be missing. This is what has led me to the conclusion that all DTD files are invalid XML files, however, it does feel weird that the XML definition syntax itself was not defined as valid XML. Why?
I'm parsing the DTD file to remove invalid attributes from an XML file, to know which attributes to keep and which to remove, so I need a way to parse the DTD file.
And ultimately, this is all just a step in trying to convert HTML to ENML (Evernote Markup Language). The steps involved in it include:
- Converting HTML to valid XHTML
- Converting the body to an en-note element
- Removing invalid tags and attributes as per the dtd file
- Validating the enml file against the dtd
I'm currently thinking to just copy the disallowed attributes and tags from "Understanding the Evernote Markup Language" and using that to validate my XHTML, but I'd prefer to use the DTD as my source.
The Nokogiri DTD class is a Node class for holding an inline DTD node and validating against it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support. And even if it did work, all I would get is validation.
I did get validation to work properly using:
#dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd')
#enml_document = XML::Document.string enml
#ret = enml_document.validate dtd
I haven't tried REXML. I will give that a go and report back.
I'm trying to convert an HTML document to a XML document that validates with the given DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to strip them, or remove them. I also need to know which attributes are allowed and which are not, so that I can parse the XML properly and remove/sanitize the offending elements and attributes.
For the cleanup purpose, I'm using Loofah, but to use it, I need a list of tag->attributes (which attributes are available for each tag). Instead of making multiple passes validating the doc, which I am doing at the end of cleanup, I'm just looping through each XML tag, and cleaning them up. But to know how to clean them, I need to know which tags and elements are supported in the valid schema. Thus, I need to parse the DTD file.
From what I understand, XLST is the right tool for the job, but I'm not comfortable enough to use it.