Design note
The answer by @tdelaney is basically right, but I want to point to one nuance of Python element tree API. Here's a quote from the lxml
tutorial:
Elements can contain text:
<root>TEXT</root>
In many XML documents (data-centric documents), this is the only place where text can be found. It is encapsulated by a leaf tag at the very bottom of the tree hierarchy.
However, if XML is used for tagged text documents such as (X)HTML, text can also appear between different elements, right in the middle of the tree:
<html><body>Hello<br/>World</body></html>
Here, the <br/>
tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail
property. It contains the text that directly follows the element, up to the next element in the XML tree.
The two properties text
and tail
are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).
Implementation
Taking these properties into account it is possible to retrieve document text without forcing the tree to output text nodes.
#!/usr/bin/env python3.3
import itertools
from pprint import pprint
try:
from lxml import etree
except ImportError:
from xml.etree import cElementTree as etree
def textAndElement(node):
'''In py33+ recursive generators are easy'''
yield node
text = node.text.strip() if node.text else None
if text:
yield text
for child in node:
yield from textAndElement(child)
tail = node.tail.strip() if node.tail else None
if tail:
yield tail
if __name__ == '__main__':
xml = '''
<species>
Mammals: <dog/> <cat/>
Reptiles: <snake/> <turtle/>
Birds: <seagull/> <owl/>
</species>
'''
doc = etree.fromstring(xml)
pprint(list(textAndElement(doc)))
#[<Element species at 0x7f2c538727d0>,
#'Mammals:',
#<Element dog at 0x7f2c538728c0>,
#<Element cat at 0x7f2c53872910>,
#'Reptiles:',
#<Element snake at 0x7f2c53872960>,
#<Element turtle at 0x7f2c538729b0>,
#'Birds:',
#<Element seagull at 0x7f2c53872a00>,
#<Element owl at 0x7f2c53872a50>]
gen = textAndElement(doc)
next(gen) # skip root
groups = []
for _, g in itertools.groupby(gen, type):
groups.append(tuple(g))
pprint(dict(zip(*[iter(groups)] * 2)) )
#{('Birds:',): (<Element seagull at 0x7fc37f38aaa0>,
# <Element owl at 0x7fc37f38a820>),
#('Mammals:',): (<Element dog at 0x7fc37f38a960>,
# <Element cat at 0x7fc37f38a9b0>),
#('Reptiles:',): (<Element snake at 0x7fc37f38aa00>,
# <Element turtle at 0x7fc37f38aa50>)}