Iterate over both text and elements in lxml etree
Asked Answered
M

2

13

Suppose I have the following XML document:

<species>
    Mammals: <dog/> <cat/>
    Reptiles: <snake/> <turtle/>
    Birds: <seagull/> <owl/>
</species>

Then I get the species element like this:

import lxml.etree
doc = lxml.etree.fromstring(xml)
species = doc.xpath('/species')[0]

Now I would like to print a list of animals grouped by species. How could I do it using ElementTree API?

Mesquite answered 5/6, 2014 at 22:13 Comment(3)
if you look over to your right ... it looks like the 4th one down under related should point you in the right direction ...Bearden
do you have control of the xml format? Normally, classifiers such as Mammals, etc, are expressed as xml element names or attributes (e.g, <species flavor="Mammals">) so that xpath selectors are easily written.Expertise
No, I can't change the XML.Mesquite
E
10

If you enumerate all of the nodes, you'll see a text node with the class followed by element nodes with the species:

>>> for node in species.xpath("child::node()"):
...     print type(node), node
... 
<class 'lxml.etree._ElementStringResult'> 
    Mammals: 
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'> 
    Reptiles: 
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'> 
    Birds: 
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'> 

So you can build it from there:

my_species = {}
current_class = None
for node in species.xpath("child::node()"):
    if isinstance(node, lxml.etree._ElementStringResult):
        text = node.strip(' \n\t:')
        if text:
            current_class = my_species.setdefault(text, [])
    elif isinstance(node, lxml.etree._Element):
        if current_class is not None:
            current_class.append(node.tag)
print my_species

results in

{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}

This is all fragile... small changes in how the text nodes are arranged can mess up the parsing.

Expertise answered 5/6, 2014 at 23:18 Comment(5)
I like this one, simple XPath :)Mesquite
@alecxe - you process an ever increasing number of previous text nodes and discard all but the last one each time... I think my solution is simpler.Expertise
In Python 3, text node's type is lxml.etree._ElementUnicodeResult.Pain
You could also use hasattr(node, 'text')Rossie
Thanks, exactly what I needed! BTW if you are building on @Alicia's example, you could/should actually use the docs variable instead of species since it's not needed for what you are doing.Branching
P
7

Design note

The answer by @tdelaney is basically right, but I want to point to one nuance of Python element tree API. Here's a quote from the lxml tutorial:

Elements can contain text:

<root>TEXT</root>

In many XML documents (data-centric documents), this is the only place where text can be found. It is encapsulated by a leaf tag at the very bottom of the tree hierarchy.

However, if XML is used for tagged text documents such as (X)HTML, text can also appear between different elements, right in the middle of the tree:

<html><body>Hello<br/>World</body></html>

Here, the <br/> tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree.

The two properties text and tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

Implementation

Taking these properties into account it is possible to retrieve document text without forcing the tree to output text nodes.

#!/usr/bin/env python3.3


import itertools
from pprint import pprint

try:
  from lxml import etree
except ImportError:
  from xml.etree import cElementTree as etree
  
  
def textAndElement(node):
  '''In py33+ recursive generators are easy'''

  yield node

  text = node.text.strip() if node.text else None
  if text:
    yield text

  for child in node:
    yield from textAndElement(child)

  tail = node.tail.strip() if node.tail else None
  if tail:
    yield tail
    

if __name__ == '__main__':
  xml = '''
    <species>
      Mammals: <dog/> <cat/>
      Reptiles: <snake/> <turtle/>
      Birds: <seagull/> <owl/>
    </species>
  '''
  doc = etree.fromstring(xml)
  
  pprint(list(textAndElement(doc)))
  #[<Element species at 0x7f2c538727d0>,
  #'Mammals:',
  #<Element dog at 0x7f2c538728c0>,
  #<Element cat at 0x7f2c53872910>,
  #'Reptiles:',
  #<Element snake at 0x7f2c53872960>,
  #<Element turtle at 0x7f2c538729b0>,
  #'Birds:',
  #<Element seagull at 0x7f2c53872a00>,
  #<Element owl at 0x7f2c53872a50>]
  
  gen = textAndElement(doc)
  next(gen) # skip root
  groups = []
  for _, g in itertools.groupby(gen, type):
    groups.append(tuple(g))
  
  pprint(dict(zip(*[iter(groups)] * 2)) )
  #{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
  #               <Element owl at 0x7fc37f38a820>),
  #('Mammals:',): (<Element dog at 0x7fc37f38a960>,
  #                <Element cat at 0x7fc37f38a9b0>),
  #('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
  #                <Element turtle at 0x7fc37f38aa50>)}
Pain answered 22/6, 2015 at 17:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.