Parsing huge, badly encoded XML files in Python
Asked Answered
G

4

12

I have been working on code that parses external XML-files. Some of these files are huge, up to gigabytes of data. Needless to say, these files need to be parsed as a stream because loading them into memory is much too inefficient and often leads to OutOfMemory troubles.

I have used the libraries miniDOM, ElementTree, cElementTree and I am currently using lxml. Right now I have a working, pretty memory-efficient script, using lxml.etree.iterparse. The problem is that some of the XML files I need to parse contain encoding errors (they advertise as UTF-8, but contain differently encoded characters). When using lxml.etree.parse this can be fixed by using the recover=True option of a custom parser, but iterparse does not accept a custom parser. (see also: this question)

My current code looks like this:

from lxml import etree
events = ("start", "end")
context = etree.iterparse(xmlfile, events=events)
event, root_element = context.next() # <items>
for action, element in context:
    if action == 'end' and element.tag == 'item':
    # <parse>
    root_element.clear() 

Error when iterparse encounters a bad character (in this case, it's a ^Y):

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0x19 0x73 0x20 0x65, line 949490, column 25

I don't even wish to decode this data, I can just drop it. However I don't know any way to skip the element - I tried context.next and continue in try/except statements.

Any help would be appreciated!

Update

Some additional info: This is the line where iterparse fails:

<description><![CDATA:[musea de la photographie fonds mercator. Met meer dan 80.000 foto^Ys en 3 miljoen negatieven is het Muse de la...]]></description>

According to etree, the error occurs at bytes 0x19 0x73 0x20 0x65.
According to hexedit, 19 73 20 65 translates to ASCII .s e
The . in this place should be an apostrophe (foto's).

I also found this question, which does not provide a solution.

Gallagher answered 9/7, 2012 at 17:46 Comment(6)
Did you tried beautiful soup?Psalms
Is it feasible to perform a pre-processing step to correct the encodings? You could probably even do this in a pipeline using a StringIO object and feeding output to etree.Bakken
@DanatheSane It certainly is, any tips on how I could go about this?Gallagher
@Gallagher If you put together some code to parse tag, attr and content parsing, you could feed problematic input into chardet (see #436720) and re-write the file as you go. I'm not sure where in the document the encoding problems are, but this shouldn't incur too much overhead if they are somewhat isolated.Bakken
Are you using Python 2 or 3? And, if 2, can you count on 2.5+, or do you need 2.4 compat? The reason I ask is that codecs.EncodedFile is probably the best solution for 2.5-2.7, but it has some problems in 2.4, and there may be even simpler answers in 3.x.Alanealanine
Please post a complete XML document that includes your top-level tag and DTD (if any) as well as the fragment, so other people can test the same thing you're testing. Also, if you can show a couple bytes before the error that might help (so we can see whether we've got half a UTF-8 character or something).Alanealanine
G
2

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()
Gallagher answered 10/7, 2012 at 21:43 Comment(0)
A
10

If the problems are actual character encoding problems, rather than malformed XML, the easiest, and probably most efficient, solution is to deal with it at the file reading point. Like this:

import codecs
from lxml import etree
events = ("start", "end")
reader = codecs.EncodedFile(xmlfile, 'utf8', 'utf8', 'replace')
context = etree.iterparse(reader, events=events)

This will cause the non-UTF8-readable bytes to be replaced by '?'. There are a few other options; see the documentation for the codecs module for more.

Alanealanine answered 9/7, 2012 at 18:1 Comment(2)
Hm, this looks like a great solution, but I've just tried it - same error at the same point, even when I change 'replace' to 'ignore'. (To answer your question above, this is Python 2.7, no compat required.)Gallagher
Can you post the XML file (or, better, a small document displaying the problem) somewhere, so people can help debug it?Alanealanine
G
2

Since the problem is being caused by illegal XML characters, in this case the 0x19 byte, I decided to strip them off. I found the following regular expression on this site:

invalid_xml = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]')

And I wrote this piece of code that removes illegal bytes while saving an xml feed:

conn = urllib2.urlopen(xmlfeed)
xmlfile = open('output', 'w')

while True:
    data = conn.read(4096)
    if data:
        newdata, count = invalid_xml.subn('', data)
        if count > 0 :
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(newdata)

    else:
        break

xmlfile.close()
Gallagher answered 10/7, 2012 at 21:43 Comment(0)
C
1

I used a similar piece of code:

 illegalxml = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')

...

illegalxml.sub("?",mystring)

...

However, this did not work for all possible strings (400+MB string).

For a final solution I used decoding/encoding as follows:

outxml = "C:/path_to/xml_output_file.xml"
with open(outxml, "w") as out:
    valid_xmlstring = mystring.encode('latin1','xmlcharrefreplace').decode('utf8','xmlcharrefreplace')
    out.write(valid_xmlstring) 
Caracalla answered 1/9, 2013 at 0:37 Comment(0)
H
0

I had a similar problem with char "" in my xml file, which is also invalid xmlchar. This is because in the xml version 1.0, the characters like &#x0, &#xE are not allowed. And the rule is that all character composition as regular expression '&#x[0-1]?[0-9A-E]' are not allowed. My purpose it to correct the invalid char in a huge xml file, based on Rik's answer, I improved it as below :

import re

invalid_xml = re.compile(r'&#x[0-1]?[0-9a-eA-E];')

new_file = open('new_file.xml','w') 
with open('old_file.xml') as f:
    for line in f:
        nline, count = invalid_xml.subn('',line)
        new_file.write(nline) 
new_file.close()
Huberty answered 8/1, 2016 at 10:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.