lxml.etree, element.text doesn't return the entire text from an element
Asked Answered
C

8

19

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

Cliffcliffes answered 22/1, 2011 at 19:56 Comment(3)
This is one way to do it (code snippet from my little python scrape processor). Wonder if this is a lxml bug?Cliffcliffes
Here's the code snippet:Cliffcliffes
if element.tag == "td": children = element.getchildren() if len(children) > 0: topic = (element.text + children[0].tail) else: topic = element.text print("\tTopic:\t\t%s" % topic)Cliffcliffes
M
18

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

Marotta answered 23/1, 2011 at 1:56 Comment(3)
toString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want.Cliffcliffes
element.text + child.tail works, but I wish element.text worked the way I want it to :)Cliffcliffes
element.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though.Cliffcliffes
T
11

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2 
Tide answered 6/10, 2013 at 13:19 Comment(0)
B
7

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result
Booher answered 21/9, 2011 at 13:9 Comment(2)
It's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: https://mcmap.net/q/665686/-python-lxml-insert-text-at-given-position-relatively-to-subelements/694360Commeasure
Thanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if .text would just return the full text and some other suitably named property would contain only the part up to the first subelement. How about node.head. This also gives a clue that what you'll want next is child.tail without having to stackoverflow first.Booher
V
7

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

Vachil answered 6/4, 2014 at 8:4 Comment(0)
B
3
<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])
Branchiopod answered 8/12, 2013 at 0:49 Comment(0)
F
1
def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')
Florentinoflorenza answered 26/1, 2012 at 3:26 Comment(0)
V
0

If the element is equal to <td>. You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

Vachil answered 23/5, 2017 at 18:51 Comment(0)
S
0
element.xpath('normalize-space()') also works.
Sansculotte answered 24/7, 2017 at 3:59 Comment(1)
Only pasting code is not enough. You should also explain why it works :)Cribbing

© 2022 - 2024 — McMap. All rights reserved.