Parsing huge, badly encoded XML files in Python

Asked 9/7, 2012 at 17:46 Answered 8/1, 2016 at 10:30

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.

I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml. Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse this can be fixed by using the recover=True option of a custom parser, but iterparse does not accept a custom parser. (see also: this question)

My current code looks like this:

from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
    if action == 'end' and element.tag == 'item':
    # <parse>
    root_element.clear()

Error when iterparse encounters a bad character (in this case, it's a ^Y):

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25

I don't even wish to decode this data, I can just drop it. However I don't know any way to skip the element - I tried context.next and continue in try/except statements.

Any help would be appreciated!

Update

Some additional info: This is the line where iterparse fails:

<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>

According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65.
According to hexedit, 19 73 20 65 translates to ASCII .s e
The . in this place should be an apostrophe (foto's).

I also found this question, which does not provide a solution.

Gallagher answered 9/7, 2012 at 17:46 Comment(6)

Did you tried beautiful soup? – Psalms 9/7, 2012 at 17:50

Is it feasible to perform a pre-processing step to correct the encodings? You could probably even do this in a pipeline using a StringIO object and feeding output to etree. – Bakken 9/7, 2012 at 17:56

@DanatheSane It certainly is, any tips on how I could go about this? – Gallagher 9/7, 2012 at 17:58

@Gallagher If you put together some code to parse tag, attr and content parsing, you could feed problematic input into chardet (see #436720) and re-write the file as you go. I'm not sure where in the document the encoding problems are, but this shouldn't incur too much overhead if they are somewhat isolated. – Bakken 9/7, 2012 at 18:2

Are you using Python 2 or 3? And, if 2, can you count on 2.5+, or do you need 2.4 compat? The reason I ask is that codecs.EncodedFile is probably the best solution for 2.5-2.7, but it has some problems in 2.4, and there may be even simpler answers in 3.x. – Alanealanine 9/7, 2012 at 18:2

Please post a complete XML document that includes your top-level tag and DTD (if any) as well as the fragment, so other people can test the same thing you're testing. Also, if you can show a couple bytes before the error that might help (so we can see whether we've got half a UTF-8 character or something). – Alanealanine 9/7, 2012 at 20:22

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()

Gallagher answered 10/7, 2012 at 21:43 Comment(0)

If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:

import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)

This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.

Alanealanine answered 9/7, 2012 at 18:1 Comment(2)

Hm, this looks like a great solution, but I've just tried it - same error at the same point, even when I change 'replace' to 'ignore'. (To answer your question above, this is Python 2.7, no compat required.) – Gallagher 9/7, 2012 at 18:16

Can you post the XML file (or, better, a small document displaying the problem) somewhere, so people can help debug it? – Alanealanine 9/7, 2012 at 18:42

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()

Gallagher answered 10/7, 2012 at 21:43 Comment(0)

I used a similar piece of code:

 illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')

...

illegalxml.sub("?",mystring)

...

However, this did not work for all possible strings (400+MB string).

For a final solution I used decoding/encoding as follows:

outxml = "C:/path_to/xml_output_file.xml"
with open(outxml, "w") as out:
    valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
    out.write(valid_xmlstring)

Caracalla answered 1/9, 2013 at 0:37 Comment(0)

I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like &#x0, &#xE are not allowed. And the rule is that all character composition as regular expression '&#x[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :

import re

invalid_xml = re.compile(r'&#x[0-1]?[0-9a-eA-E];')

new_file = open('new_file.xml','w') 
with open('old_file.xml') as f:
    for line in f:
        nline, count = invalid_xml.subn('',line)
        new_file.write(nline) 
new_file.close()

Huberty answered 8/1, 2016 at 10:30 Comment(0)

Recommended topics

Hot tags