set the implicit default encoding\decoding error handling in python
Asked Answered
D

2

2

I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added

sys.setdefaultencoding('latin_1') 

sure enough, now working with latin1 strings works fine.

But, in case I encounter something that is not encoded in latin1:

s=str(u'abc\u2013')

I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.

I tried doing different things with codecs.register_error but to no avail.

please help?

Depth answered 29/7, 2010 at 14:9 Comment(3)
If you're doing s=str(u'abc\u2013'), you want to work in unicode, which seems weird if you set default encoding to latin-1Mary
This is a bad idea: tarekziade.wordpress.com/2008/01/08/…. If you're working with encoded strings, you should explicitly decode them. Otherwise you're likely to cover up nasty bugs.Kershner
Specifically, you should work with Unicode inside your module. Decode the external data when it comes in, and encode it when it goes out again.Kershner
M
2

There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.

Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.

Explicit decoding takes a parameter specifying behavior for undecodable bytes.

Marinna answered 29/7, 2010 at 17:14 Comment(0)
M
1

You can define your own custom handler and use it instead to do as you please. See this example:

import codecs
from logging import getLogger

log = getLogger()

def custom_character_handler(exception):
    log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
            exception.reason,
            exception.object[exception.start:exception.end],
            exception.encoding,
            exception.start,
            exception.end )
    return ("?", exception.end)

codecs.register_error("custom_character_handler", custom_character_handler)

print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )

Running it, you will see:

invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'

References:

  1. https://docs.python.org/3/library/codecs.html#codecs.register_error
  2. https://docs.python.org/3/library/exceptions.html#UnicodeError
  3. How to ignore invalid lines in a file?
  4. 'str' object has no attribute 'decode'. Python 3 error?
  5. How to replace invalid unicode characters in a string in Python?
  6. UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?
Make answered 27/4, 2019 at 2:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.