UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
Asked Answered
C

5

48

I am trying to read twitter data from json file using python 2.7.12.

Code I used is such:

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '\r\n':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

Result I got:

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.

Crosscurrent answered 22/7, 2016 at 4:11 Comment(0)
D
28

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://mcmap.net/q/37316/-why-should-we-not-use-sys-setdefaultencoding-quot-utf-8-quot-in-a-py-script for more information

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

Dagger answered 22/7, 2016 at 20:46 Comment(9)
Thank you!!! I just tried but it is not working - I am trying to replace 'windows-1252' as I am using mac. I tried 'latin-1' etc. Or does it not matter...? Thanks for the detail explanation...Crosscurrent
Code runs but I get "NULL" for all variables in the database. When I opened each json file and checked, there are tweets in the file. Also, when I asked to print number of tweets, it says I have 0 tweets....Crosscurrent
You're going to need to debug your code and found out what's not working. It sounds like my solution has worked but you have missed something. What happens if you print tweet in your for loop?Dagger
Ok. I am trying to debug code...It gives me blank list i.e. [ ]. Thank you so much!!Crosscurrent
I am little confused..I guess you mean when I print line right after tweet=json.loads(line)? In this case, nothing prints out..Crosscurrent
The example I gave had an error in it - the file mode should have been r. It will throw an error with rw. What code have you got?Dagger
Hi:) I tried rmode and it gave me error as such: ValueError: No JSON object could be decoded. So I have a+ instead in my code. Would this be a problem...? I thought issue is coming from somewhere else..Crosscurrent
a+ means open the file for updating and append to the end always. Change it back to r, then print line and then work out why the JSON can't be decoded.Dagger
Thanks!! I'll start from there again!Crosscurrent
S
147

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

Stole answered 17/10, 2017 at 9:29 Comment(4)
Thanks Sung! was the same issue for me on Mac OS, I've ignored the file in code eventually, since it will get re-generated by the OS and I wasn't sure how safe it will be to permanently remove it...Inkerman
yes! thank you! i simply added an if statement to read only .txt files and it worksTrusting
Same here, I was using os.listdir() which contained only .csv files and this .DS_Store was creating problems. Easily solved with try and except.Springfield
In this case you can use find . -name '.DS_Store' -type f -delete to delete all such files recursively in the current directory. See jonbellah.com/articles/recursively-remove-ds-storeMoorer
D
28

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://mcmap.net/q/37316/-why-should-we-not-use-sys-setdefaultencoding-quot-utf-8-quot-in-a-py-script for more information

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 is valid in some characters sets. In windows-1252/cp1252 it's .

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

Dagger answered 22/7, 2016 at 20:46 Comment(9)
Thank you!!! I just tried but it is not working - I am trying to replace 'windows-1252' as I am using mac. I tried 'latin-1' etc. Or does it not matter...? Thanks for the detail explanation...Crosscurrent
Code runs but I get "NULL" for all variables in the database. When I opened each json file and checked, there are tweets in the file. Also, when I asked to print number of tweets, it says I have 0 tweets....Crosscurrent
You're going to need to debug your code and found out what's not working. It sounds like my solution has worked but you have missed something. What happens if you print tweet in your for loop?Dagger
Ok. I am trying to debug code...It gives me blank list i.e. [ ]. Thank you so much!!Crosscurrent
I am little confused..I guess you mean when I print line right after tweet=json.loads(line)? In this case, nothing prints out..Crosscurrent
The example I gave had an error in it - the file mode should have been r. It will throw an error with rw. What code have you got?Dagger
Hi:) I tried rmode and it gave me error as such: ValueError: No JSON object could be decoded. So I have a+ instead in my code. Would this be a problem...? I thought issue is coming from somewhere else..Crosscurrent
a+ means open the file for updating and append to the end always. Change it back to r, then print line and then work out why the JSON can't be decoded.Dagger
Thanks!! I'll start from there again!Crosscurrent
C
11

For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.

This was the original code:

import pickle as pkl
with open(pkl_path, 'r') as f:
    obj = pkl.load(f)

And this fixed the error:

import pickle as pkl
with open(pkl_path, 'rb') as f:
    obj = pkl.load(f)
Catfish answered 16/3, 2022 at 15:18 Comment(0)
C
0

I got a similar error by accidentally trying to read a parquet file as a csv

pd.read_csv(file.parquet)

pd.read_parquet(file.parquet)

Cammycamomile answered 4/11, 2022 at 22:16 Comment(0)
T
-1

The error occurs when you are trying to read a tweet containing sentence like

"@Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String . but Converting to raw String Still gives error so i better i suggest you to

read a json file something like this:

import codecs
import json
    with codecs.open('tweetfile','rU','utf-8') as f:
             for line in f:
                data=json.loads(line)
                print data["tweet"]
keys.append(data["id"])
            fulldata.append(data["tweet"])

which will get you the data load from json file .

You can also write it to a csv using Pandas.

import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )

Then read from csv to avoid the encoding and decoding problem

hope this will help you solving you problem.

Midhun

Tigre answered 22/7, 2016 at 7:2 Comment(6)
What are you talking about with this "cannot be read as a string" and "must be converted to raw string". There is no such thing as a raw string in Python. There are raw string literals, but you can't do any runtime conversions to those for what I hope are obvious reasons.Phagy
hey when i got same error while reading json file i was able to overcome it by above code thats why i suggested it .I am wrong you are always welcome to correct me.Tigre
@MidhunMohan, Thank you!! I referred to your code as well.Crosscurrent
No....not yet..I changed your code as such, as I was little confused what you meant by "tweet" and "id". with codecs.open(file_name,'rU','utf-8') as twitter_file:/ for line in twitter_file:/ tweet = json.loads(line)/ print line/ if u'info' not in tweet.keys():/ tweets.append(tweet)/Crosscurrent
According error is ValueError: No JSON object could be decoded. I am trying to debug my code as there might be issue in other part of codes...Crosscurrent
according to the basics jsons is loaded as dictionary so data=json.loads(line) print data["tweet"] /this will give you the value of the key tweetTigre

© 2022 - 2024 — McMap. All rights reserved.