set the implicit default encoding\decoding error handling in python

D

2

I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added

sys.setdefaultencoding('latin_1')

sure enough, now working with latin1 strings works fine.

But, in case I encounter something that is not encoded in latin1:

s=str(u'abc\u2013')

I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.

I tried doing different things with codecs.register_error but to no avail.

please help?

Depth answered 29/7, 2010 at 14:9 Comment(3)

If you're doing s=str(u'abc\u2013'), you want to work in unicode, which seems weird if you set default encoding to latin-1 – Mary 29/7, 2010 at 14:16

This is a bad idea: tarekziade.wordpress.com/2008/01/08/…. If you're working with encoded strings, you should explicitly decode them. Otherwise you're likely to cover up nasty bugs. – Kershner 29/7, 2010 at 14:17

Specifically, you should work with Unicode inside your module. Decode the external data when it comes in, and encode it when it goes out again. – Kershner 29/7, 2010 at 14:21

M

2

There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.

Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.

Explicit decoding takes a parameter specifying behavior for undecodable bytes.

Marinna answered 29/7, 2010 at 17:14 Comment(0)

M

1

You can define your own custom handler and use it instead to do as you please. See this example:

import codecs
from logging import getLogger

log = getLogger()

def custom_character_handler(exception):
    log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
            exception.reason,
            exception.object[exception.start:exception.end],
            exception.encoding,
            exception.start,
            exception.end )
    return ("?", exception.end)

codecs.register_error("custom_character_handler", custom_character_handler)

print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )

Running it, you will see:

invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'

References:

Make answered 27/4, 2019 at 2:16 Comment(0)

Recommended topics

Hot tags