Python + Expat: Error on � entities
Asked Answered
P

2

5

I have written a small function, which uses ElementTree and xpath to extract the text contents of certain elements in an xml file:

#!/usr/bin/env python2.5

import doctest
from xml.etree import ElementTree
from StringIO import StringIO

def parse_xml_etree(sin, xpath):
  """
Takes as input a stream containing XML and an XPath expression.
Applies the XPath expression to the XML and returns a generator
yielding the text contents of each element returned.

>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem1').next()
'one'
>>> parse_xml_etree(
...   StringIO('<test><elem1>one</elem1><elem2>two</elem2></test>'),
...   '//elem2').next()
'two'
>>> parse_xml_etree(
...   StringIO('<test><null>&#0;</null><elem3>three</elem3></test>'),
...   '//elem2').next()
'three'
"""

  tree = ElementTree.parse(sin)
  for element in tree.findall(xpath):
    yield element.text  

if __name__ == '__main__':
  doctest.testmod(verbose=True)

The third test fails with the following exception:

ExpatError: reference to invalid character number: line 1, column 13

Is the &#0; entity illegal XML? Regardless whether it is or not, the files I want to parse contain it, and I need some way to parse them. Any suggestions for another parser than Expat, or settings for Expat, that would allow me to do that?


Update: I discovered BeautifulSoup just now, a tag soup parser as noted below in the answer comment, and for fun I went back to this problem and tried to use it as an XML-cleaner in front of ElementTree, but it dutifully converted the &#0; into a just-as-invalid null byte. :-)

cleaned_s = StringIO(
  BeautifulStoneSoup('<test><null>&#0;</null><elem3>three</elem3></test>',
                     convertEntities=BeautifulStoneSoup.XML_ENTITIES
  ).renderContents()
)
tree = ElementTree.parse(cleaned_s)

... yields

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 12

In my particular case though, I didn't really need the XPath parsing as such, I could have gone with BeautifulSoup itself and its quite simple node adressing style parsed_tree.test.elem1.contents[0].

Preconscious answered 14/6, 2010 at 16:9 Comment(0)
S
6

&#0; is not in the legal character range defined by the XML spec. Alas, my Python skills are pretty rudimentary, so I'm not much help there.

Slay answered 14/6, 2010 at 16:13 Comment(3)
Hm, yes, the specification makes it quite clear. Thank you for the exact reference.Preconscious
I realize this is an old thread, but the spec says what literal characters may only appear in XML. The byte sequence &#0; is not literally a null character, but a 4-character sequence that represents a null byte. Given that distinction, is &#0; legal? I can't find anything in the spec that says that is illegal.Fewness
A valid question. But the answer is here: w3.org/TR/REC-xml/#sec-references says "Characters referred to using character references MUST match the production for Char."Preconscious
M
4

&#0; is not a valid XML character. Ideally, you'd be able to get the creator of the file to change their process so that the file was not invalid like this.

If you must accept these files, you could pre-process them to turn &#0 into something else. For example, pick @ as an escape character, turn "@" into "@@", and "&#0;" into "@0".

Then as you get the text data from the parser, you can reverse the mapping. This is just an example, you can invent any escaping syntax you like.

Marilou answered 14/6, 2010 at 16:23 Comment(2)
In my particular case, I could just delete them. They are in an irrelevant element of the XML. Feels shaky to use text processing to handle XML though, but since it's not well-formed I guess I have no choice... Using some sort of tag soup parser seems overkill.Preconscious
Are you sure that escaping algorithm is robust? Don't you have to consider the precedence of the features within XML's grammar?Jaejaeger

© 2022 - 2024 — McMap. All rights reserved.