ParseError: not well-formed (invalid token) using cElementTree

N

15

37

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

Newhall answered 24/10, 2012 at 9:18 Comment(0)

P

33

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)

Pinfold answered 24/10, 2012 at 9:25 Comment(3)

I don't want to change anything about the content of the XML I receive, I just need to transform it into an cElementTree Element. – Newhall 24/10, 2012 at 9:30

escaping is not the same as changing btw. – Pinfold 24/10, 2012 at 9:39

the recover it no longer available for ElementTrees XMLParser, right? Or whats 'lxml'? It's not vanilla python? – Gitagitel 4/4, 2017 at 13:19

P

29

I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

Pointless answered 25/11, 2013 at 22:24 Comment(4)

I don't believe that is correct, fromstring doesn't take any arguments apart from text (it does not accept parser). Perhaps you meant XML instead of fromstring? – Severe 16/6, 2015 at 15:12

parse function has parser attribute, thus you can give it a file name as input instead of string: e = ElementTree.parse(my_file, parser=ElementTree.XMLParser(encoding='iso-8859-5') ) – Tuberculous 1/3, 2017 at 21:16

As mentioned by the first comment, fromstring doesn't accept the parser argument. This answer is wrong in syntax. – Novobiocin 17/6, 2017 at 4:6

The docs suggests ET.fromstringlist([xmlstring], parser=parser) could be used to achieve what is intended here. – Photoluminescence 15/10, 2020 at 15:31

L

9

This code snippet worked for me. I have an issue with the parsing batch of XML files. I had to encode them to 'iso-8859-5'

import xml.etree.ElementTree as ET

tree = ET.parse(filename, parser = ET.XMLParser(encoding = 'iso-8859-5'))

Lepanto answered 25/2, 2020 at 19:24 Comment(1)

This worked for me, too, I wonder, why. – Summerville 21/12, 2022 at 9:52

A

7

See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

Antoineantoinetta answered 24/10, 2012 at 9:35 Comment(0)

C

7

None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')

Cleveland answered 8/5, 2018 at 10:56 Comment(2)

There is no 'xml parser from BeautifulSoup'. When you provide the xml parameter to BeautifulSoup, it uses lxml's XML parser under the hood.. – Newhall 8/5, 2018 at 11:7

@Newhall thanks, yes I meant you require to install lxml before being able to use BeautifulSoup this way. At least in my case I had to install that separately... – Cleveland 8/5, 2018 at 11:11

T

5

After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:

escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)

And use it like you'd normally do:

ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)

Terzetto answered 13/12, 2019 at 9:57 Comment(0)

G

4

This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.

The workaround (Python 3.6)

import io
from xml.etree import ElementTree as ET

with io.open(file, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)

Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.

Groundage answered 13/2, 2018 at 14:29 Comment(0)

P

3

A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""

xmltest = ET.fromstring(xml.encode("utf-8"))

However, it works with the addition of a hyphen in the encoding type:

<?xml version='1.0' encoding='utf-8'?>

Most odd. Someone found this footnote in the python docs:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

Pawl answered 6/9, 2017 at 19:35 Comment(0)

L

2

In my case I got the same error. (using Element Tree)

I had to add these lines:

    import xml.etree.ElementTree as ET
    from lxml import etree

    parser = etree.XMLParser(recover=True,encoding='utf-8')
    xml_file = ET.parse(path_xml,parser=parser)

Works in pyhton 3.10.2

Loiretcher answered 9/8, 2022 at 17:43 Comment(1)

this works great, with only minimal code changes coming from xml.etree – Dorinda 28/8, 2023 at 15:40

U

1

I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition

for file in files:
    if file.endswith('.xml'):
       run_your_code...

This trick helped me as well

Undesirable answered 23/6, 2017 at 19:38 Comment(0)

B

1

lxml solved the issue, in my case

from lxml import etree

for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
    print(ele.tag, ele.text)

in another case,

parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')

Thanks to theeastcoastwest

Python 2.7

Bellwort answered 24/10, 2019 at 5:49 Comment(0)

E

0

What helped me with that error was Juan's answer - https://mcmap.net/q/413582/-parseerror-not-well-formed-invalid-token-using-celementtree But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.

The solution wasn't working for "normal" UTF-8.

Endeavor answered 5/2, 2016 at 10:20 Comment(2)

And what argument do you pass for that? – Tarnation 28/12, 2016 at 1:34

oh, it was such a long time ago. don't remember actually, but probably I just saved a file properly in notepad++ – Endeavor 29/12, 2016 at 8:5

B

0

The only thing that worked for me is I had to add mode and encoding while opening the file like below:

with open(filenames[0], mode='r',encoding='utf-8') as f:
     readFile()

Otherwise it was failing every time with invalid token error if I simply do this:

 f = open(filenames[0], 'r')
 readFile()

Buchholz answered 29/8, 2019 at 18:28 Comment(0)

S

0

this error is coming while you are giving a link . but first you have to find the string of that link

response = requests.get(Link) root = cElementTree.fromstring(response.content)

Slenderize answered 26/7, 2022 at 9:57 Comment(1)

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Benefactor 27/7, 2022 at 11:12

H

-1

I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

def ParseXmlTagContents(source, tag, tagContentsRegex):
    openTagString = "<"+tag+">"
    closeTagString = "</"+tag+">"
    found = re.search(openTagString + tagContentsRegex + closeTagString, source)
    if found:   
        start = found.regs[0][0]
        end = found.regs[0][1]
        return source[start+len(openTagString):end-len(closeTagString)]
    return ""

Example usage would be:

<?xml version="1.0" encoding="utf-16"?>
<parentNode>
    <childNode>123</childNode>
</parentNode>

ParseXmlTagContents(xmlString, "childNode", "[0-9]+")

Hanna answered 6/9, 2018 at 13:36 Comment(0)

Recommended topics

Hot tags