How to keep comments while parsing XML using Python / ElementTree
Asked Answered
J

1

14

Currently using Python 2.4.3, and not allowed to upgrade

I want to change the values of a given attribute in one or more tags, together with XML-comments in the updated file.

I have managed to create a Python script that takes a XML-file as argument, and for each tag specified changes an attribute, as shown below

def update(file, state):
    global Etree
    try:
        from elementtree import ElementTree
        print '*** using ElementTree'
    except ImportError, e:
        print '***'
        print '*** Error: Must install either ElementTree or lxml.'
        print '***'
        raise ImportError, 'must install either ElementTree or lxml'
    #end try

    doc = Etree.parse(file)
    root = doc.getroot()

    for element in root.findall('.//StateManageable'):
        element.attrib['initialState'] = state
    #end for
    doc.write(file)
#end def

This is all fine, the attributes "initialState" are updated, except for the fact that my original XML contains a lot of XML comments as well, but they are long gone, which is bad.

Suspect that parse only retrieves the XML-structure, but I thought XML-comments where a part of the structure. I also realize that the "human-readable" formatting of my original document is long gone, but that I have realized is expected behavior, need to format afterwards using xmllint --format or XSL.

Jeroboam answered 17/12, 2010 at 21:7 Comment(1)
you bet, I had a hard time when starting to create my first scripts realizing that all the good stuff I found examples of was for 2.7 :-)Jeroboam
W
19

I know this is old now, but I stumbled across this answer above about how to retain comments. Frederik's published instructions about how to put comments into the tree still works with current versions of ElementTree, but does more than it needs to for my use, at least. It wraps the XML in a element, which is undesirable for me. I also don't need processing instructions preserved, but only comments. So, I trimmed down the class he provided on the site to this:

import xml.etree.ElementTree as ET

class PCParser(ET.XMLTreeBuilder):

   def __init__(self):
       ET.XMLTreeBuilder.__init__(self)
       # assumes ElementTree 1.2.X
       self._parser.CommentHandler = self.handle_comment

   def handle_comment(self, data):
       self._target.start(ET.Comment, {})
       self._target.data(data)
       self._target.end(ET.Comment)

To use this, create an instance of this object as a 'parser' and then pass as a parameter to ElementTree.parse() like this:

parser = PCParser()
self.tree = ET.parse(self.templateOut, parser=parser)

I take no credit whatsoever for the code, or for the undocumented use of ElementTree, but it works for me in preserving only comments without affecting the original document structure. And note that any future change to ElementTree (seems unlikely at this point after all these years, though) will break this.

Wilkes answered 6/12, 2014 at 15:46 Comment(3)
I'm using lxml for this and trying to get it to work. I'm importing from lxml import etree as et. I think I can replace self._parser with et but can't figure out what to use instead of self._target. Can you help?Aerostatics
This doesn't work for python3 (tested with v3.5.4) as the api has changed. See here for python3 solution.Maid
@Jon , Link is not workingCause

© 2022 - 2024 — McMap. All rights reserved.