How can one replace an element with text in lxml?
Asked Answered
K

3

13

It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. For example, given the following input:

input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

... you could easily remove every <r> element with:

from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
    r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)

However, how would you go about replacing each element with text, to get the output:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

It seems to me that because the ElementTree API deals with text via the .text and .tail attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a .tail attribute, and so on. Have I missed some easy way of doing this?

Kirkkirkcaldy answered 24/3, 2011 at 11:11 Comment(2)
If <r/> has children, do you want those removed too? Or merged into <r/>'s parent?Bang
In this case I just want to remove the <r> node and all its children, and replace it with a text string. Hopefully that's easier :)Kirkkirkcaldy
B
20

I think that unutbu's XSLT solution is probably the correct way to achieve your goal.

However, here's a somewhat hacky way to achieve it, by modifying the tails of <r/> tags and then using etree.strip_elements.

from lxml import etree

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f = etree.fromstring(data)
for r in f.xpath('//r'):
  r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'

etree.strip_elements(f,'r',with_tail=False)

print etree.tostring(f,pretty_print=True)

Gives you:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
Bang answered 24/3, 2011 at 14:13 Comment(5)
Thanks, that's a nice solution - I didn't know about strip_elements or the with_tail optinoKirkkirkcaldy
Wanted to stick with lxml for html processing. But will probably switch to Beautifulsoup, it's far more intuitive for basic html modification, and can use lxml as a parser... soup = BeautifulSoup(text, "lxml") / soup.find_all('r').replace_with('DELETED')Fragrant
Thanks @benzkij for the tip! It is super weird, that text is sometimes treated as the tail of other nodes in the ElementTree API and not just as a normal text node as intended by xml.Mistrot
@Mistrot XML does not intend anything, and the DOM which you're thinking of is but one possible object model. Not being the DOM is ElementTree's entire point, if you want the DOM there are packages which implement it.Picro
@Picro Thanks for clearing that up! I guess was so used to DOM representations of XML from other languages/libraries, that I thought it was the intended way to represent XML. (Still think that it would be more convenient to have text as nodes in a tree similar to elements, but good to know that it is not prescribed by XML itself)Mistrot
C
8

Using strip_elements has the disadvantage that you cannot make it keep some of the <r> elements while replacing others. It also requires the existence of an ElementTree instance (which may be not the case). And last, you cannot use it to replace XML comments or processing instructions. The following should do your job:

for r in f.xpath('//r'):
    text = 'DELETED' + r.tail 
    parent = r.getparent()
    if parent is not None:
        previous = r.getprevious()
        if previous is not None:
            previous.tail = (previous.tail or '') + text
        else:
            parent.text = (parent.text or '') + text
        parent.remove(r)
Coronary answered 9/5, 2012 at 16:50 Comment(1)
I think text = 'DELETED' + r.tail should be text = 'DELETED' + r.tail if r.tail else 'DELETED'.Sher
E
4

Using ET.XSLT:

import io
import lxml.etree as ET

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f=ET.fromstring(data)
xslt='''\
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">    

    <!-- Replace r nodes with DELETED
         http://www.w3schools.com/xsl/el_template.asp -->
    <xsl:template match="r">DELETED</xsl:template>

    <!-- How to copy XML without changes
         http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->    
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="@*|text()|comment()|processing-instruction">
        <xsl:copy-of select="."/>
    </xsl:template>
    </xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)

print(ET.tostring(f))

yields

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
Exarate answered 24/3, 2011 at 12:31 Comment(1)
+1 That's a nice, but really non-obvious answer :) This question occurred to me because of my insufficient answer to another question and I was hoping there was an easier way than this. Even with a short example like this, XSLT is verbose and difficult to understand compared to the code in my question for just removing the elements.Kirkkirkcaldy

© 2022 - 2024 — McMap. All rights reserved.