How to use Python XML findall to find '<v:imagedata r:id="rId7" o:title="1-REN"/>'
Asked Answered
C

3

5

I'm trying to do a find all from a Word document for <v:imagedata r:id="rId7" o:title="1-REN"/> with namespace xmlns:v="urn:schemas-microsoft-com:vml" and I cannot figure out what on earth the syntax is.

The docs only cover the very straight forward case and with the URN and VML combo thrown in I can't seem to get any of the examples I've seen online to work. Does anyone happen to know what it is?

I'm trying to do something like this:

namespace = {'v': "urn:schemas-microsoft-com:vml"}

results = ET.fromstring(xml).findall("imagedata", namespace)
for image_id in results:
    print(image_id)

Edit: What @aneroid wrote is 1000% the right answer and super helpful. You should upvote it. That said, after understanding all that - I went with the BS4 answer because it does the entire job in two lines exactly how I need it to πŸ˜‚. If you don't actually care about the namespaces it seems waaaaaaay easier.

Caskey answered 31/5, 2020 at 1:5 Comment(0)
K
4

ET.findall() vs BS4.find_all():

  • ElementTree's findall() is not recursive by default*. It's only going to find direct children of the node provided. So in your case, it's only searching for image nodes directly under the root element.
    • * as per mzjn's comment below, prefixing the match argument (tag or path) with ".//" will search for that node anywhere in the tree, since it's supports XPath's.
  • BeautifulSoup's find_all() searches all descendants. So it seaches for 'imagedata' nodes anywhere in the tree.
  • However, ElementTree.iter() does search all descendants. Using the 'working with namespaces' example in the docs:

    >>> for char in root.iter('{http://characters.example.com}character'):
    ...     print(' |-->', char.text)
    ...
     |--> Lancelot
     |--> Archie Leach
     |--> Sir Robin
     |--> Gunther
     |--> Commander Clement
    
  • Sadly, ET.iterfind() which works with namespaces as a dict (like ET.findall), also does not search descendants, only direct children by default*. Just like ET.findall. Apart from how empty strings '' in the tags are treated wrt the namespace, and one returns a list while the other returns an iterator, I can't say there's a meaningful difference between ET.findall and ET.iterfind.
    • * As above for ET.findall(), prefixing ".//" makes it search the entire tree (matches with any node).

When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:

namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("v:imagedata", namespace)  # note the 'v:'

Also, the 'v' doesn't need to be a 'v', you could change it to something more meaningful if needed:

namespace = {'image': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("image:imagedata", namespace)

Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.

To get all the imagedata elements anywhere in the tree, use the ".//" prefix:

results = ET.fromstring(xml).findall(".//v:imagedata", namespace)
Karb answered 31/5, 2020 at 3:20 Comment(2)
findall can find all imagedata nodes. Just use findall(".//v:imagedata", namespace). – Garvin
Thanks! I've edited and clarified my answer wrt ET.findall(), as well as ET.iterfind(). – Karb
G
18

With ElementTree in Python 3.8, you can simply use a wildcard ({*}) for the namespace:

results = ET.fromstring(xml).findall(".//{*}imagedata") 

Note the .// part, which means that the whole document (all descendants) is searched.

Garvin answered 31/5, 2020 at 14:30 Comment(0)
K
4

ET.findall() vs BS4.find_all():

  • ElementTree's findall() is not recursive by default*. It's only going to find direct children of the node provided. So in your case, it's only searching for image nodes directly under the root element.
    • * as per mzjn's comment below, prefixing the match argument (tag or path) with ".//" will search for that node anywhere in the tree, since it's supports XPath's.
  • BeautifulSoup's find_all() searches all descendants. So it seaches for 'imagedata' nodes anywhere in the tree.
  • However, ElementTree.iter() does search all descendants. Using the 'working with namespaces' example in the docs:

    >>> for char in root.iter('{http://characters.example.com}character'):
    ...     print(' |-->', char.text)
    ...
     |--> Lancelot
     |--> Archie Leach
     |--> Sir Robin
     |--> Gunther
     |--> Commander Clement
    
  • Sadly, ET.iterfind() which works with namespaces as a dict (like ET.findall), also does not search descendants, only direct children by default*. Just like ET.findall. Apart from how empty strings '' in the tags are treated wrt the namespace, and one returns a list while the other returns an iterator, I can't say there's a meaningful difference between ET.findall and ET.iterfind.
    • * As above for ET.findall(), prefixing ".//" makes it search the entire tree (matches with any node).

When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:

namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("v:imagedata", namespace)  # note the 'v:'

Also, the 'v' doesn't need to be a 'v', you could change it to something more meaningful if needed:

namespace = {'image': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("image:imagedata", namespace)

Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.

To get all the imagedata elements anywhere in the tree, use the ".//" prefix:

results = ET.fromstring(xml).findall(".//v:imagedata", namespace)
Karb answered 31/5, 2020 at 3:20 Comment(2)
findall can find all imagedata nodes. Just use findall(".//v:imagedata", namespace). – Garvin
Thanks! I've edited and clarified my answer wrt ET.findall(), as well as ET.iterfind(). – Karb
C
1

I'm going to leave the question open, but the workaround I'm currently using is to use BeautifulSoup which happily accepts the v: syntax.

soup = BeautifulSoup(xml, "lxml")

results = soup.find_all("v:imagedata")
Caskey answered 31/5, 2020 at 1:40 Comment(0)

© 2022 - 2025 β€” McMap. All rights reserved.