Lxml element equality with namespaces
Asked Answered
S

6

8

I am attempting to use Lxml to parse the contents of a .docx document. I understand that lxml replaces namespace prefixes with the actual namespace, however this makes it a real pain to check what kind of element tag I am working with. I would like to be able to do something like

if (someElement.tag == "w:p"):

but since lxml insists on prepending te ful namespace I'd either have to do something like

if (someElemenet.tag == "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}p'):

or perform a lookup of the full namespace name from the element's nsmap attribute like this

targetTag = "{%s}p" % someElement.nsmap['w']
if (someElement.tag == targetTag):

If there were was an easier way to convince lxml to either

  1. Give me the tag string without the namespace appended to it, I can use the prefix attribute along with this information to check which tag I'm working with OR
  2. Just give me the tag string using the prefix

This would save a lot of keystrokes when writing this parser. Is this possible? Am I missing something in the documentation?

Shelby answered 30/3, 2011 at 23:40 Comment(1)
You don't ever want to match on the prefix, as the prefix is completely arbitrary. A valid .docx file could have any prefix, even 'xyz', as long as it was assigned to the same actual namespace string. lxml is doing you a favor by preventing you from relying on the namespace prefix for matching.Johppah
S
22

Perhaps use local-name():

import lxml.etree as ET
tree = ET.fromstring('<root xmlns:f="foo"><f:test/></root>')
elt=tree[0]
print(elt.xpath('local-name()'))
# test
Selective answered 31/3, 2011 at 1:26 Comment(0)
L
5

I could not find a way to obtain the non-namespaced tag name from an element -- lxml considers the full namespace part of the tag name. Here are a few options which may help..

You could also use the QName class to construct a namespaced tag for comparisons:

import lxml.etree
from lxml.etree import QName

tree = lxml.etree.fromstring('<root xmlns:f="foo"><f:test/></root>')
qn = QName(tree.nsmap['f'], 'test')
assert tree[0].tag == qn

If you need the bare tag name you'll have to write a utility function to extract it:

def get_bare_tag(elem):
    return elem.tag.rsplit('}', 1)[-1]

assert get_bare_tag(tree[0]) == 'test'

Unfortunately, to my knowledge you can't search for tags with "any namespace" (e.g. {*}test) using lxml's xpath / find methods.

Updated: Note that lxml won't construct a tag that contains only { or } -- it will raise ValueError: invalid tag name, so it is safe to assume that an element whose tag name starts with { is balanced.

lxml.etree.Element('{foo')
ValueError: Invalid tag name
Lucilla answered 31/3, 2011 at 0:20 Comment(6)
A bit wasteful, especially the [1:]. To get the "bare tag", all you need is elem.tag.split('}')[-1]. Note that neither this code nor yours cares about unbalanced braces.Dermott
Used rsplit to be more efficient, assuming the namespace urls tend to be long.Lucilla
Good point. You didn't get rid of the pointless if statement; non-namespace tags tend to be short.Dermott
My head is juggling 5 different things at the moment... Simplified it to one line. Thanks John.Lucilla
[sigh] I was going to say "... unbalanced braces, which are impossible" but thought that I'd be regarded like Sybil Fawlty as "statin' the bleedin' obvious" :)Dermott
+1 for suggesting the QName class. As for obtaining the local name (if you really have to) you can use xpath: elem.xpath('local-name()')Holst
P
5

etree.Qname should be able to get you what you want.

from lxml import etree

# [...]

tag = etree.QName(someElement)

print(tag.namespace, tag.localname)

For your example tag, this will output:

http://schemas.openxmlformats.org/wordprocessingml/2006/main p

Note that QName will take either the Element object or a string (such as from Element.tag).

And, as you note, you can also use Element.nsmap to map from an arbitrary prefix to a namespace.

So something like this:

if tag.namespace == someElement.nsmap["w"] and tag.localname == "p":
Placebo answered 14/10, 2016 at 4:13 Comment(0)
D
2

To save time when looking for high-volume tags like p (paragraph, I presume) in docx or c (cell) in xlsx, it's usual to set up the full tag once at the global or class level:

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_p = WPML_URI + 'p'
tag_t = WPML_URI + 't'

I have never seen an explanation of why one would want to use QName().

In the other direction, given a full tag, you can extract the base tag easily:

base_tag = full_tag.rsplit("}", 1)[-1]

Dermott answered 31/3, 2011 at 3:27 Comment(0)
T
1

I'm no Python expert, but I also had this problem (Windows 7 "Contacts" files). I wrote the following function for the lxml system.

This function takes an element, and returns its tag with the prefix substituted from the file's ns tag.

from lxml import etree

def denstag(ee):
  tag = ee.tag
  for ns in ee.nsmap:
    prefix = "{"+ee.nsmap[ns]+"}"
    if tag.startswith(prefix):               
      return ns+":"+tag[len(prefix):]
  return tag
Taynatayra answered 3/1, 2012 at 8:37 Comment(0)
S
0

Here is my solution for restoring real (source) xml tag name

Assuming we have xml_node variable, an instance of lxml Element

Before: {http://some/namespace/url}TagName (as read from xml_node.tag prop)

After: nsprefix:TagName (as result of xml_get_real_tag_name(xml_node))

def xml_get_real_tag_name(xml_node):
    """Replace lxml '{http://some/namespace/url}TagName' with regular 'nsprefix:TagName' string
    Args:
        xml_node (lxml.etree.Element) Source xml node entity
    Returns:
        str
    """
    if '{' in xml_node.tag:
    return ':'.join([xml_node.prefix, etree.QName(xml_node).localname])
else:
    return xml_node.tag
Sunnysunproof answered 18/5, 2020 at 18:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.