Why does lxml.etree.SubElement() allow making elements which are not serialisable?
Asked Answered
S

1

6
from lxml import etree

element1 = etree.Element('{j:a}a', nsmap={None: 'j:a'})
etree.SubElement(element1, 'b')

element2 = etree.Element('{j:a}a', nsmap={None: 'j:a'})
etree.SubElement(element2, '{j:a}b')

both elements serialise to the same

<a xmlns="j:a"><b/></a>

but both elements do not behave the same

element1.find('b') -> returns the Element

element2.find('b') -> returns None

if you do it the other way around

etree.fromstring("<a xmlns="j:a"><b/></a>")

you get the representation from element2, so

element2.find('b') -> returns None

which seems consistent because there is no namespaceless <b/> in the tree, because <b/> inherits the default namespace from <a/>

so what's the purpose of the representation in element1? It seems to add a namespaceless subelement <b/> and behaves that way. But when serialised the element inherits from <a>.

Why does this exist if it does not serialise anyway?

Souffle answered 8/11, 2021 at 19:14 Comment(6)
which lxml version are you using?Trothplight
lxml version is 4.6.3Souffle
What exactly is it that you think lxml should not allow? What did you expect?Caseate
@Caseate The expected behavior would be that all elements that don't specify a namespace of their own will inherit the default namespace that's in effect in their location. That's how lxml's document parsing works (in accordance with the spec), but it's not how lxml's document generation works. That's at least "surprising API behavior", and it generates document trees that cannot exist and don't behave like they should. In my book this qualifies as a bug.Race
An idea/conjecture: The concept of namespace declaration scope (rpbourret.com/xml/NamespacesFAQ.htm#scope) can only really be applied to serialized XML documents. It does not apply to elements that have been created with an API and only exist as in-memory data structures.Caseate
@Caseate That's certainly one position to have on the matter, but I think that's argued from an "API implementer's convenience" point of view. It's certainly easier to do it that way. But elements either are in a namespace, or they are not. There is no ambiguity to resolve. The right thing to do in the first case would be to produce <a xmlns="j:a"><b xmlns="" /></a>. That would be consistent, it would generate a legal XML tree, it would behave the same after parsing, and it would make it immediately obvious that the API wants explicit namespaces as in .SubElement(element2, '{j:a}b').Race
T
1

It all comes down to namespaces

xml tags can (but must not) have a namespace. So even if the root node defines a default namespace, child nodes are allowed to not have a namespace, which is not equivalent to be in the default namespace.

This is the difference between your element1 and element2: element1's subelement has no namespace; element2's subelement is in the default namespace, since when you create it you specify the default namespace. If you try

element2.find("{j:l}b")) -> returns the element b, or to be more accurate, the element {j:a}b.

So yes, namespace matters. And when you create the elements with lxml, you can define elements without namespace: just don't add it.

But what about serialization?

Now I am not an lxml expert, so this is just my guess on the point. Thing is when you serialize the element, there is no way to discriminate between elements which are really without namespace and element in the default namespace, so they are represented in the same way.

Consequently, serializing an element and then parsing it again, cannot give the original result. If for example, using your element1 you do:

sel1 = etree.tostring(element1)
element1s = etree.fromstring(sel1)

It turns out that element1s is not equal to element1, because the subelement b now is subelement {j:a}b. When parsing the string, elements without namespace are added to the default namespace.

Conclusion

Now, I don't know if this is intended or is a bug. At the best of my knowledge, if an XML document declares a default namespace, all elements which do not explicitly have a different namespace should be considered in the default namespace. As it happens when you parse an xml document with the fromstring function. You can have a "no namespace" only if no default namespace is declared.
So in my opinion your b subelement of element1 should "inherit" the namespace of the parent node, since parent node defines a default namespace with nsmap={None: "j:a"}.
But you could also be told that since you are building the document using lxml elements, it's your responsibility to put each element in the correct namespace, which means you have to add the default namespace explicitly.

Since elements without namespaces are allowed by xml under some circustances, lxml does not complain when an element do not have a namespace.
I think that automatic addition of the default namespaces to subelement of elements which declare a default namespace would be a cool feature, but it's just not there.

Trothplight answered 8/11, 2021 at 22:55 Comment(3)
"there is no way to discriminate between elements which are really without namespace and element in the default namespace, so they are represented in the same way." - No, that's absolutely not true. Elements in a default namespace are in a namespace. The elements <foo:element xmlns:foo="some_ns_uri" /> and <element xmlns="some_ns_uri" /> are semantically indistinguishable. But <element xmlns="some_ns_uri" /> and <element /> are completely different things.Race
But overall, you're correct. This is clearly at least an oversight in lxml, I'd say it qualifies as a bug. There can be no elements in XML that both don't declare their own namespace, and also don't inherit their scope's default namespace. The asymmetry in lxml's behavior between parsing <a xmlns="foo"><b /></a> and manually building the same thing shows that something is not right.Race
The piece of code that implements SubElement shows no sign of looking up and adding the parent's default namespace. It only seems to look at the given nsmap, and the given prefix. (But I get a feeling that their point of view is that all element names in the lxml API are supposed to be fully qualified. So when you give b, and really mean {j:a}b, then it's your own fault.)Race

© 2022 - 2024 — McMap. All rights reserved.