UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1
Asked Answered
H

16

113

I'm having a few issues trying to encode a string to UTF-8. I've tried numerous things, including using string.encode('utf-8') and unicode(string), but I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)

This is my string:

(。・ω・。)ノ

I don't see what's going wrong, any idea?

Edit: The problem is that printing the string as it is does not show properly. Also, this error when I try to convert it:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
Helianthus answered 12/5, 2012 at 7:39 Comment(2)
It's just a normally inserted string. The same happens when I just try printing it.Helianthus
I meet the same when pip install, and fix it from here: [install some devel][1] [1]: #17932226Oliguria
S
74

This is to do with the encoding of your terminal not being set to UTF-8. Here is my terminal

$ echo $LANG
en_GB.UTF-8
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> 

On my terminal the example works with the above, but if I get rid of the LANG setting then it won't work

$ unset LANG
$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
>>> 

Consult the docs for your linux variant to discover how to make this change permanent.

Spud answered 12/5, 2012 at 11:5 Comment(2)
Missing locales could also be a reason. To install them run sudo apt-get install language-pack-de or sudo locale-gen de_DE.UTF-8 (for german locales).Phytohormone
For me, the missing environment variable is LC_ALL, and the simplest value that would fix it is C.UTF-8Caducity
P
25

try:

string.decode('utf-8')  # or:
unicode(string, 'utf-8')

edit:

'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'.decode('utf-8') gives u'(\uff61\uff65\u03c9\uff65\uff61)\uff89', which is correct.

so your problem must be at some oter place, possibly if you try to do something with it were there is an implicit conversion going on (could be printing, writing to a stream...)

to say more we'll need to see some code.

Provender answered 12/5, 2012 at 7:53 Comment(4)
Both return UnicodeEncodeError: 'charmap' codec can't encode characters in position 1-5: character maps to <undefined>Helianthus
'(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'Helianthus
All I'm trying to do is print the original string in its original format, but I get (´¢í´¢Ñ¤ë´¢Ñ´¢í)´¥ë.Helianthus
the string is utf8-encoded. if you print it, it just wirites the bytes to the output stream, and if your terminal doesn't interpret it as utf8 you end up with garbage. with decode you convert it to unicode, then you can encode it again to an encoding your terminal understands.Provender
R
22

My +1 to mata's comment at https://mcmap.net/q/193661/-unicodedecodeerror-39-ascii-39-codec-can-39-t-decode-byte-0xef-in-position-1 and to the Nick Craig-Wood's demonstration. You have decoded the string correctly. The problem is with the print command as it converts the Unicode string to the console encoding, and the console is not capable to display the string. Try to write the string into a file and look at the result using some decent editor that supports Unicode:

import codecs

s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
s1 = s.decode('utf-8')
f = codecs.open('out.txt', 'w', encoding='utf-8')
f.write(s1)
f.close()

Then you will see (。・ω・。)ノ.

Reversible answered 12/5, 2012 at 11:43 Comment(0)
G
10

If you are working on a remote host, look at /etc/ssh/ssh_config on your local PC.

When this file contains a line:

SendEnv LANG LC_*

comment it out with adding # at the head of line. It might help.

With this line, ssh sends language related environment variables of your PC to the remote host. It causes a lot of problems.

Gracielagracile answered 28/9, 2014 at 1:18 Comment(1)
Thanks! These solved the problem that I had installing pip packages with ansible and vagrantNudism
G
10

Try setting the system default encoding as utf-8 at the start of the script, so that all strings are encoded using that.

# coding: utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Gouda answered 18/11, 2017 at 17:14 Comment(2)
why do we need the reload in this case?Contraction
This does not work in Python 3 as explained here. For me, Tsutomu's answer below did the trick.Gastrolith
F
5

It's fine to use the below code in the top of your script as Andrei Krasutski suggested.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

But I will suggest you to also add # -*- coding: utf-8 -* line at very top of the script.

Omitting it throws below error in my case when I try to execute basic.py.

$ python basic.py
  File "01_basic.py", line 14
SyntaxError: Non-ASCII character '\xd9' in file basic.py on line 14, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The following is the code present in basic.py which throws above error.

code with error

from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Then I added # -*- coding: utf-8 -*- line at very top and executed. It worked.

code without error

# -*- coding: utf-8 -*-
from pylatex import Document, Section, Subsection, Command, Package
from pylatex.utils import italic, NoEscape

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def fill_document(doc):
    with doc.create(Section('ِش سثؤفهخى')):
        doc.append('إخع ساخعمي شمصشغس سحثشن فاث فقعفا')
        doc.append(italic('فشمهؤ ؤخىفثىفس شقث شمسخ ىهؤث'))

        with doc.create(Subsection('آثص ٍعلاسثؤفهخى')):
            doc.append('بشةخعس ؤقشئغ ؤاشقشؤفثقس: $&#{}')


if __name__ == '__main__':
    # Basic document
    doc = Document('basic')
    fill_document(doc)

Thanks.

Fleawort answered 21/3, 2018 at 8:14 Comment(2)
Using #coding: utf-8 rather than # -*- coding: utf-8 -*- this is easier to remember. Works out of the box with Python PEP 263 -- Defining Python Source Code Encodings.Gouda
Thanks for the suggestion. Will try out at my end and update it in the answer.Fleawort
R
4

No problems with my terminal. The above answers helped me looking in the right directions but it didn't work for me until I added 'ignore':

fix_encoding = lambda s: s.decode('utf8', 'ignore')

As indicated in the comment below, this may lead to undesired results. OTOH it also may just do the trick well enough to get things working and you don't care about losing some characters.

Ries answered 25/12, 2013 at 3:34 Comment(2)
This is wrong, you're forcing your encoding lambda function to ignore the encoding itself which means you're losing characters.Caine
This solved my problem, where I did not know the original encoding and I did not care about losing some characters.Henig
C
2

this works for ubuntu 15.10:

sudo locale-gen "en_US.UTF-8"
sudo dpkg-reconfigure locales
Chaworth answered 11/7, 2016 at 13:7 Comment(0)
D
1

It looks like your string is encoded to utf-8, so what exactly is the problem? Or what are you trying to do here..?

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89'
>>> s1 = s.decode('utf-8')
>>> print s1
(。・ω・。)ノ
>>> s2 = u'(。・ω・。)ノ'
>>> s2 == s1
True
>>> s2
u'(\uff61\uff65\u03c9\uff65\uff61)\uff89'
Dyal answered 12/5, 2012 at 8:8 Comment(1)
Printing the original string as is gives (´¢í´¢Ñ¤ë´¢Ñ´¢í)´¥ë, I want it to encode properly.Helianthus
M
1

In my case, it was caused by my Unicode file being saved with a "BOM". To solve this, I cracked open the file using BBEdit and did a "Save as..." choosing for encoding "Unicode (UTF-8)" and not what it came with which was "Unicode (UTF-8, with BOM)"

Molybdenite answered 20/6, 2017 at 21:5 Comment(0)
A
0

I was getting the same type of error, and I found that the console is not capable of displaying the string in another language. Hence I made the below code changes to set default_charset as UTF-8.

data_head = [('\x81\xa1\x8fo\x89\xef\x82\xa2\x95\xdb\x8f\xd8\x90\xa7\x93x\x81\xcb3\x8c\x8e\x8cp\x91\xb1\x92\x86(\x81\x86\x81\xde\x81\x85)\x81\xa1\x8f\x89\x89\xf1\x88\xc8\x8aO\x81A\x82\xa8\x8b\xe0\x82\xcc\x90S\x94z\x82\xcd\x88\xea\x90\xd8\x95s\x97v\x81\xa1\x83}\x83b\x83v\x82\xcc\x82\xa8\x8e\x8e\x82\xb5\x95\xdb\x8c\xaf\x82\xc5\x8fo\x89\xef\x82\xa2\x8am\x92\xe8\x81\xa1', 'shift_jis')]
default_charset = 'UTF-8' #can also try 'ascii' or other unicode type
print ''.join([ unicode(lin[0], lin[1] or default_charset) for lin in data_head ])
Aleras answered 10/5, 2016 at 10:16 Comment(0)
S
0

This is the best answer: https://mcmap.net/q/25720/-setting-the-correct-encoding-when-piping-stdout-in-python

in linux:

export PYTHONIOENCODING=utf-8

so sys.stdout.encoding is OK.

Slob answered 26/2, 2017 at 14:20 Comment(0)
O
-1

BOM, it's so often BOM for me

vi the file, use

:set nobomb

and save it. That nearly always fixes it in my case

Olindaolinde answered 19/4, 2018 at 13:14 Comment(0)
G
-1

I had the same error, with URLs containing non-ascii chars (bytes with values > 128)

url = url.decode('utf8').encode('utf-8')

Worked for me, in Python 2.7, I suppose this assignment changed 'something' in the str internal representation--i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place. Unicode in Python is black magic for me. Hope useful

Guanaco answered 20/7, 2018 at 20:57 Comment(0)
A
-2

i solve that problem changing in the file settings.py with 'ENGINE': 'django.db.backends.mysql', don´t use 'ENGINE': 'mysql.connector.django',

Alissaalistair answered 29/6, 2014 at 4:27 Comment(2)
@rayryeng Could you explain the reason for your edit? It appears to completely change the meaning of what the OP wrote, from recommending a particular setting to recommending against it.Bracteate
@AndrewMedico - My apologies. I saw that this post was very similar to another one so I believed that they were the same. I will revert back.Desirous
B
-2

Just convert the text explicitly to string using str(). Worked for me.

Bikaner answered 5/2, 2016 at 8:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.