How to create <!DOCTYPE> with Python's cElementTree
Asked Answered
D

4

17

I have tried to use the answer in this question, but can't make it work: How to create "virtual root" with Python's ElementTree?

Here's my code:

import xml.etree.cElementTree as ElementTree
from StringIO import StringIO
s = '<?xml version=\"1.0\" encoding=\"UTF-8\" ?><!DOCTYPE tmx SYSTEM \"tmx14a.dtd\" ><tmx version=\"1.4a\" />'
tree = ElementTree.parse(StringIO(s)).getroot()
header = ElementTree.SubElement(tree,'header',{'adminlang': 'EN',})
body = ElementTree.SubElement(tree,'body')
ElementTree.ElementTree(tree).write('myfile.tmx','UTF-8')

When I open the resulting 'myfile.tmx' file, it contains this:

<?xml version='1.0' encoding='UTF-8'?>
<tmx version="1.4a"><header adminlang="EN" /><body /></tmx>

What am I missing? or, is there a better tool?

Diarmuid answered 15/1, 2012 at 7:38 Comment(0)
C
13

You could use lxml and its tostring function:

from lxml import etree

s = """<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4a"/>""" 

tree = etree.fromstring(s)
header = etree.SubElement(tree,'header',{'adminlang': 'EN'})
body = etree.SubElement(tree,'body')

print etree.tostring(tree, encoding="UTF-8",
                     xml_declaration=True,
                     pretty_print=True,
                     doctype='<!DOCTYPE tmx SYSTEM "tmx14a.dtd">')

=>

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
  <header adminlang="EN"/>
  <body/>
</tmx>
Cirrostratus answered 15/1, 2012 at 9:5 Comment(2)
I get this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. with Python 3.6Ceroplastics
etree.fromstring(s.encode("UTF-8")) works for me with Python 3.6.Cirrostratus
T
17

You could set xml_declaration argument on write function to False, so output won't have xml declaration with encoding, then just append what header you need manually. Actually if you set your encoding as 'utf-8' (lowercase), xml declaration won't be added too.

import xml.etree.cElementTree as ElementTree

tree = ElementTree.Element('tmx', {'version': '1.4a'})
ElementTree.SubElement(tree, 'header', {'adminlang': 'EN'})
ElementTree.SubElement(tree, 'body')

with open('myfile.tmx', 'wb') as f:
    f.write('<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE tmx SYSTEM "tmx14a.dtd">'.encode('utf8'))
    ElementTree.ElementTree(tree).write(f, 'utf-8')

Resulting file (newlines added manually for readability):

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
    <header adminlang="EN" />
    <body />
</tmx>
Typhon answered 15/1, 2012 at 8:52 Comment(6)
can you explain how did you added new line to the xml?Benedix
@Learner: I added it manually for readability. If you want to have XML with new lines from ElementTree - search how to pretty print XML.Typhon
This gives me an error TypeError: write() argument must be str, not bytes in python 3.6.4 on macOS. I think it's because you are writing first as a string, then as binary in the same open() command.Britanybritches
@ElliottB thanks, I updated code. Should work on both python 2 and 3.Typhon
This solution doesn't work, except if you enter manually (as said) the ElementTree which is surely not what you want to do. I put a simple & stupid solution to this problem below.Tolkan
@Benedix You could simply insert a "\n" (without quotes) into the string between the XML declaration and the doctype.Rist
C
13

You could use lxml and its tostring function:

from lxml import etree

s = """<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4a"/>""" 

tree = etree.fromstring(s)
header = etree.SubElement(tree,'header',{'adminlang': 'EN'})
body = etree.SubElement(tree,'body')

print etree.tostring(tree, encoding="UTF-8",
                     xml_declaration=True,
                     pretty_print=True,
                     doctype='<!DOCTYPE tmx SYSTEM "tmx14a.dtd">')

=>

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
  <header adminlang="EN"/>
  <body/>
</tmx>
Cirrostratus answered 15/1, 2012 at 9:5 Comment(2)
I get this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. with Python 3.6Ceroplastics
etree.fromstring(s.encode("UTF-8")) works for me with Python 3.6.Cirrostratus
T
2

I used different solution to add DOCTYPE, very simple, very stupid.

import xml.etree.ElementTree as ET

with open(path_file, "w", encoding='UTF-8') as xf:
    doc_type = '<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE dlg:window ' \
               'PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "dialog.dtd">'
    tostring = ET.tostring(root).decode('utf-8')
    file = f"{doc_type}{tostring}"
    xf.write(file)
Tolkan answered 25/3, 2019 at 16:12 Comment(0)
E
0

I couldn't find a solution to this problem either using vanilla ElementTree, and the solution proposed by demalexx created non-valid XML that was rejected by my application (DITA). What I propose is a workaround involving other modules and it works perfectly for me.

import re
# found no way for cleanly specify a <!DOCTYPE ...> stanza in ElementTree so
# so we substitute the current <?xml ... ?> stanza with a full <?xml... + <!DOCTYPE...
new_header = '<?xml version="1.0" encoding="UTF-8" ?>\n' \
                 '<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">\n'

target_xml = re.sub(u"\<\?xml .+?>", new_header, source_xml)
with open(filename, 'w') as catalog_file:
    catalog_file.write(target_xml.encode('utf8'))
Ernie answered 11/5, 2017 at 18:12 Comment(2)
Could you elaborate on the "non-valid XML" problem?Rist
@posfan12, I'll guess that the main issue would have been not having the DTD at the beginning of the line, which is easy to fix in demalexx's answer.Canto

© 2022 - 2024 — McMap. All rights reserved.