Setting the encoding for sax parser in Python
Asked Answered
R

5

6

When I feed a utf-8 encoded xml to an ExpatParser instance:

def test(filename):
    parser = xml.sax.make_parser()
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parser.feed(line)

...I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "test.py", line 72, in search_test
    parser.feed(line)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)

I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?

Roulers answered 13/5, 2009 at 12:9 Comment(0)
P
5

Your code fails in Python 2.6, but works in 3.0.

This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):

def test(filename):
    parser = xml.sax.make_parser()
    parser.parse(open(filename))
Pronouncement answered 13/5, 2009 at 12:22 Comment(0)
B
5

The SAX parser in Python 2.6 should be able to parse utf-8 without mangling it. Although you've left out the ContentHandler you're using with the parser, if that content handler attempts to print any non-ascii characters to your console, that will cause a crash.

For example, say I have this XML doc:

<?xml version="1.0" encoding="utf-8"?>
<test>
   <name>Champs-Élysées</name>
</test>

And this parsing apparatus:

import xml.sax

class MyHandler(xml.sax.handler.ContentHandler):

    def startElement(self, name, attrs):
        print "StartElement: %s" % name

    def endElement(self, name):
        print "EndElement: %s" % name

    def characters(self, ch):
        #print "Characters: '%s'" % ch
        pass

parser = xml.sax.make_parser()
parser.setContentHandler(MyHandler())

for line in open('text.xml', 'r'):
    parser.feed(line)

This will parse just fine, and the content will indeed preserve the accented characters in the XML. The only issue is that line in def characters() that I've commented out. Running in the console in Python 2.6, this will produce the exception you're seeing because the print function must convert the characters to ascii for output.

You have 3 possible solutions:

One: Make sure your terminal supports unicode, then create a sitecustomize.py entry in your site-packages and set the default character set to utf-8:

import sys sys.setdefaultencoding('utf-8')

Two: Don't print the output to the terminal (tongue-in-cheek)

Three: Normalize the output using unicodedata.normalize to convert non-ascii chars to ascii equivalents, or encode the chars to ascii for text output: ch.encode('ascii', 'replace'). Of course, using this method you won't be able to properly evaluate the text.

Using option one above, your code worked just fine for my in Python 2.5.

Buatti answered 13/5, 2009 at 13:18 Comment(1)
The actual problem in the original question is nothing to do with printing unicode to the terminal. It's due to the fact that the OP was pre-decoding the input with codecs.open, as Stephan202 has identified.Natty
A
5

Jarret Hardie already explained the issue. But those of you who are coding for the command line, and don't seem to have the "sys.setdefaultencoding" visible, the quick work around this bug (or "feature") is:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Hopefully reload(sys) won't break anything else.

More details in this old blog:

The Illusive setdefaultencoding

Axletree answered 4/12, 2009 at 18:3 Comment(0)
B
3

To set an arbitrary file encoding for a SAX parser, one can use InputSource as follows:

def test(filename, encoding):
    parser = xml.sax.make_parser()
    with open(filename, "rb") as f:
        input_source = xml.sax.xmlreader.InputSource()
        input_source.setByteStream(f)
        input_source.setEncoding(encoding)
        parser.parse(input_source)

This allows parsing an XML file that has a non-ASCII, non-UTF8 encoding. For example, one can parse an extended ASCII file encoded with LATIN1 like: test(filename, "latin1")

(Added this answer to directly address the title of this question, as it tends to rank highly in search engines.)

Beltane answered 8/11, 2015 at 19:24 Comment(0)
S
0

Commenting on janpf's answer (sorry, I don't have enough reputation to put it there), note that Janpf's version will break IDLE which requires its own stdout etc. that is different from sys's default. So I'd suggest modifying the code to be something like:

import sys

currentStdOut = sys.stdout
currentStdIn = sys.stdin
currentStdErr = sys.stderr

reload(sys)
sys.setdefaultencoding('utf-8')

sys.stdout = currentStdOut
sys.stdin = currentStdIn
sys.stderr = currentStdErr

There may be other variables to preserve, but these seem like the most important.

Shanks answered 20/8, 2012 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.