'ascii' codec can't encode character : ordinal not in range (128)
Asked Answered
W

4

7

I'm scraping some webpages using selenium and beautifulsoup. I'm iterating through a bunch of links, grabbing info, and then dumping it into a JSON:

for event in events:

    case = {'Artist': item['Artist'], 'Date': item['Date'], 'Time': item['Time'], 'Venue': item['Venue'],
        'Address': item['Address'], 'Coordinates': item['Coordinates']}
    item[event] = case

with open("testScrape.json", "w") as writeJSON:
json.dump(item, writeJSON, ensure_ascii=False)

When I get to this link: https://www.bandsintown.com/e/100778334-jean-deaux-music-at-rickshaw-stop?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event

The code breaks and I get the following error:

 Traceback (most recent call last):
  File "/Users/s/PycharmProjects/hi/BandsintownWebScraper.py", line 126, in <module>
    json.dump(item, writeJSON, ensure_ascii=False)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 190, in dump
    fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 7: ordinal not in range(128)

I've tried to use:

json.dump(item, writeJSON, ensure_ascii=False).decode('utf-8')

And:

json.dump(item, writeJSON, ensure_ascii=False).encode('utf-8')

With no success. I believe it is the ï character on the link that is causing this to fail. Can anyone give a brief run-down of what's happening, what encode/decode means, and how to fix this issue?

Wiltz answered 12/5, 2019 at 23:49 Comment(3)
this error may means that data is not in UTF-8 but in other encoding - ie, Latin1, CP1250.Decerebrate
Python keeps text in unicode - and it means every char can use even 8 bytes. To send it or save in file it is converted (encoded) to utf-8, latin1, etc. to use less space. Encoded chars may use 1 byte, other 2 or more bytes - and it use less space then 8 bytes for evey char. When you get it then you have to convert (decode) it back to unicode so Python can use it.Decerebrate
you didn't show full error (Traceback) but I think problem is not in this part of code but rather in code which gets data from web page. Maybe You may have to decode data from page. Or you may have to set encoding in open(..., encode='utf-8')Decerebrate
H
21

You might need to set PYTHONIOENCODING before running your python script in the shell. For example, I got the same error while redirecting the python script output into a log file:

$ your_python_script > output.log
'ascii' codec can't encode characters in position xxxxx-xxxxx: ordinal not in range(128)

After changing PYTHONIOENCODING to UTF8 in the shell, script executed with no ASCII codec error:

$ export PYTHONIOENCODING=utf8

$ your_python_script > output.log
Hermon answered 11/8, 2019 at 8:46 Comment(3)
Thanks so much! I was going crazy trying to figure out why my python3 wasn't printing basic unicode!Schwaben
Thank you!!! I was struggled all day with this bug until your solution solved it!Lulita
Thank you this resolved my issue. I tried sanitizing data within code. used normalize, decode methods. I should've checked the runtime environment before blaming code. Thank you again for pointing this out.Stairs
S
7

Your problem is that, in Python 2, a file object (as returned by open()) can only write str objects, not unicode objects. Passing ensure_ascii=False to json.dump() makes it attempt to write Unicode strings to the file directly as unicode objects, which will fail.

json.dump(item, writeJSON, ensure_ascii=False).encode('utf-8')

This attempted fix doesn't work because json.dump() doesn't return anything; instead, it writes content directly to the file. (If there weren't any Unicode text in item, this would crash after json.dump() completed -- json.dump() returns None, which can't have .encode() called on it.)

There's three ways to go about fixing this:

  1. Use Python 3. The unification of str and unicode in Python 3 makes your existing code work as-is; no code changes are necessary.

  2. Remove ensure_ascii=False from your call to json.dump. Non-ASCII characters will be written to the file in escaped form -- for instance, ï will be written as \u00ef. This is a perfectly valid way of representing Unicode characters, and most JSON libraries will handle it just fine.

  3. Wrap the file object in a UTF-8 StreamWriter:

    import codecs
    with codecs.getwriter("utf8")(open("testScrape.json", "w")) as writeJSON:
        json.dump(item, writeJSON, ensure_ascii=False)
    
Sclerenchyma answered 13/5, 2019 at 1:18 Comment(2)
A way to do sort of the same thing as fix 1 and 3 at the same time would be to use io.open instead of Python 2's builtin open function. io.open is the same as Python 3's open, so it supports Unicode by default.Gibbosity
@Blckknght, you should make that an answer, it certainly helped me :)Maricruzmaridel
H
2

I got the same issue when running pipeline in gitlab (Never saw this error in Github Action, Circle CI or other pipelines)

Finally fix by this way

before_script:
    - apt-get clean && apt-get update && apt-get install -y locales
    - echo "en_US UTF-8" > /etc/locale.gen
    - locale-gen en_US.UTF-8
    - export LANG=en_US.UTF-8
    - export LANGUAGE=en_US:en
    - export LC_ALL=en_US.UTF-8
Hallucinogen answered 25/5, 2022 at 14:33 Comment(0)
L
0

pip install unidecode

from unidecode import unidecode

for col in ['column1', 'column2']:
    df[col] = df[col].apply(unidecode)

If this is a pandas object, then just put the name of your columns inside [] passing it as lists.. As I had the same issue today, and figured it out.

Lenette answered 7/12, 2021 at 22:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.