Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Asked Answered
M

20

378

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools

An error occurred when compiling "process.py" on the above site.

python tools/process.py --input_dir data --operation resize --output_dir data2/resize
data/0.jpg -> data2/resize/0.png

Traceback (most recent call last):
  File "tools/process.py", line 235, in <module>
    main()
  File "tools/process.py", line 167, in main
    src = load(src_path)
  File "tools/process.py", line 113, in load
    contents = open(path).read()
  File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

What is the cause of the error? Python's version is 3.5.2.

Mikesell answered 20/2, 2017 at 8:43 Comment(0)
O
394

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).

Since you did not provide any code we could look at, we only could guess on the rest.

From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:

with open(path, 'rb') as f:
  contents = f.read()

That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Oblate answered 20/2, 2017 at 9:26 Comment(8)
I'm getting the error "ValueError: mode string must begin with one of 'r', 'w', 'a' or 'U', not 'br'"Iterative
@Iterative Ok, then use rb (I thought order was of no importance, but it seems to be, at least in some systems/versions). I changed my answer accordingly.Oblate
byte 0xff in position 0 could also mean the file is encoded in UTF-16, then you can do with open(path, encoding='utf-16') as f: insteadGrettagreuze
What if there is actually no 0xff character at position 0? And it is UTF-8 encoded.Buzzard
A pure '\xFF' character will be encoded in UTF-8 as '\xC3\xBF'. UTF-8 encodes all characters with a set MSB using two characters. (See the output of printf "\xff" | iconv -f latin1 -t utf-8 | xxd in a shell.) A verbatim '\xFF' in the beginning of a UTF-8 encoded string is an encoding error (could be called a syntax error in terms of UTF-8).Oblate
@NikolaiRKristiansen it's not. b'\xff'.decode('utf16') => UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0xff in position 0: truncated dataMeldameldoh
I think this is the correct answer: when dealing with binary files the encoding is not involved, and should not be, at all.Romany
@Meldameldoh The \xff in the beginning is not part of the file contents but part of a marker which tells us the encoding of the rest of the file. So read binary (without decoding): b = open(..., 'rb').read() and then check for the encoding and decode: if b[0] == b'\xff': return b[1:].decode('utf-16'). IIRC the marker for utf-16 is \xff and a second byte, not just one byte.Oblate
G
180

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.

with open(path, encoding="utf8", errors='ignore') as f:

Using errors='ignore' You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server. Then its a easy direct solution. reference

Glynisglynn answered 1/2, 2018 at 5:53 Comment(7)
Works for decode() as well: contents = contents.decode('utf-8', 'ignore') Source: docs.python.org/3/howto/unicode.html#the-string-typeSeavey
When you say "lose some characters" do you mean that the file with errors won't be read? or that not all the content of that file will be read?Jacy
@Jacy As it is ignoring the errors, so some encodings won't be read which are causing issues. But haven't ever come across any content that has been skipped while reading. So basically ecoding issues are ignored.Glynisglynn
@NitishKumarPal, ok so no real content should be skipped or lostJacy
I dont understand, it did the work but I didnt skipped any csv row. Not sure what happened here.Germanium
Resolved the issue of converting HEX into CHAR representation using this encoding. Thanks a lotFattal
thanks I kept getting errors reading a sql file using utf-8 on some of the files. When I added ignore errors I was able to continue processing the sqls.Nimmons
T
83

Use encoding format ISO-8859-1 to solve the issue.

Tetrahedron answered 4/6, 2019 at 20:4 Comment(6)
This will hide the error but produce garbage if the actual encoding is not ISO-8859-1. If you are not sure, examine some of the strings with character codes in the range 128-255. Maybe see also tripleee.github.io/8bitBelton
This will eliminate errors, but only because ISO-8859-1 defines a character for each one of the 256 possible byte values. They won't necessarily be the right characters and you need to verify that you're reading the correct text despite the lack of errors.Litha
Some times it will give correct characters but most of times it will give garbage values like "1.5 1 0 obj > endobj 2 0 obj > endobj 4 0 obj > stream x½½þù/qyúßù§ÿ¢ÿèÿþðçõ¯ÿø¿þòÿG\ü;x¯¯oüùïó_÷þýòÿøß~ù¿ùå/þ¡îÝR^?/jáòòòüþô~ÿ|þx}L_¿}^__.÷ÛóçûÓëççóíöôöúòüÒWÿú¿x¿0´ÍIâ èÛå)ä¼{$éúÎ oÎçåùóZØil¬Pÿá$0JÏ{²úñsr^nSquilgee
No, it only has 256, but it means every byte corresponds to one character, and thus you will not ever get any errors. If this wasn't the correct encoding, you get mojibake in your output instead.Belton
@YashrajNigam You are overgeneralizing. It sounds like most of the time you are processing data which uses a completely different encoding, but that is untrue for many visitors here. The real solution, as several of the comments and answers here try to explain, is to establish the correct character encoding in each case.Belton
This worked for me. I'm probably doing this way wrong,but in Google Colab I was saving to my GoogleDrive as a doc with the extension .docx. Even though i was specifiying "UTF-8" when writing the file. I guess I still had characters that were not UTF-8 because reading the file in the same format would generate errors. But when reading with the format above, the file rendered perfectlySoneson
K
37

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.

with open(path_to_file,'rb') as f:
    contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")

this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.

Kaminski answered 16/8, 2017 at 15:34 Comment(3)
In Python 3 you can simplify this by using the encoding param with open(path, encoding='utf-16') as fGrettagreuze
@NikolaiRKristiansen I tried using your method, but got an error as TypeError: an integer is required (got type str). Why? Both files are binary and read as rb.Keloid
@Keloid The encoding param only makes sense when reading text. Drop the 'b' from the mode argument and try again. Read more in the docs: docs.python.org/3/library/functions.html#openGrettagreuze
T
26

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.

With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.

Heavily edited after I found out the real answer

Thyroxine answered 4/12, 2017 at 13:1 Comment(2)
This ended 2 hours of headache! Opening the file with open('filename', 'r') as f: and then printing its contents shows UTF-8, which is wrong.Comedian
This one worked for my case. with open(filename, encoding='utf-16') as f:Watkins
P
11

Those getting similar errors while handling Pandas for data frames use the following solution.

example solution.

df = pd.read_csv("File path", encoding='cp1252')
Provencher answered 23/8, 2021 at 8:6 Comment(3)
This is where I ended up, without knowing this answer. Just checked in this thread whether someone answered like this, and yes - someone did.Observer
Randomly guessing at a different encoding might remove the error, but could then produce garbage results. The useful answer would be a way to figure out which encoding is actually correct.Belton
You might be right, with that, you can use chardet to auto-detect the file type encoding. Maybe I should also state that this solution directly applies to reading CSV files created from Microsoft excel.Provencher
I
10

I had a similar issue with PNG files. and I tried the solutions above without success. this one worked for me in python 3.8

with open(path, "rb") as f:
Inflammatory answered 24/11, 2020 at 11:22 Comment(0)
T
9

use only

base64.b64decode(a) 

instead of

base64.b64decode(a).decode('utf-8')
Thingumabob answered 17/6, 2018 at 13:34 Comment(3)
its working but just to understand can you explian why please? :)Specialty
especially, where do you use it? What is 'a'?Pacify
removing decode solved an issue I was having too.Ardeth
S
8

It simply means that one chose the wrong encoding to read the file.

On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.

Silkstocking answered 10/11, 2019 at 19:32 Comment(1)
file is very imprecise when it comes to guessing encodings. You can try the Python libraries chardet or ftfy but they too are heuristic tools, not infallible oracles. If you know or are able to guess what text to expect, you can look up problem bytes at tripleee.github.io/8bit and figure out at least a good guess for your data.Belton
S
8

This is due to the different encoding method when read the file. In python, it defaultly encode the data with unicode. However, it may not works in various platforms.

I propose an encoding method which can help you solve this if 'utf-8' not works.

with open(path, newline='', encoding='cp1252') as csvfile:
    reader = csv.reader(csvfile)

It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.

Shallot answered 23/9, 2020 at 23:9 Comment(0)
A
5

I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using

pd.read_csv(filename, delimiter=';')

I used:

pd.read_csv(open(filename, 'r'), delimiter=';')

which just seems working fine for me.

Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.

Aretino answered 28/12, 2021 at 8:45 Comment(0)
B
2

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

to check your serial port config on linux use : stty -F /dev/ttyUSBX -a

Broadspectrum answered 19/4, 2019 at 10:36 Comment(0)
E
2

I had a similar issue and searched all the internet for this problem

if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8"> and it will work....

just create a new HTML file in the same location and use a different name

Euchre answered 30/7, 2020 at 23:55 Comment(0)
L
1

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:

newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Lactoscope answered 4/7, 2017 at 14:19 Comment(0)
P
1

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.

Phylys answered 21/1, 2019 at 19:49 Comment(1)
Are you sure you answered the correct question? What has a .DS_Store file to do with the Python UnicodeDecodeError?Osteopath
K
0

I had a similar problem.

Solved it by:

import io

with io.open(filename, 'r', encoding='utf-8') as fn:
  lines = fn.readlines()

However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.

So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.

Karyolysis answered 1/11, 2019 at 10:50 Comment(0)
F
-1

You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.

The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.

You can you latin1 or other encoding values.

I say try and test to find the right one for your dataset.

Frit answered 15/8, 2020 at 8:7 Comment(0)
B
-3

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..

Basically answered 19/5, 2020 at 7:48 Comment(1)
Could you explain why Linux or question marks have anything to do with this question? The question mark can be encoded/decoded using ASCII, UTF-8, Latin-1 and many more encodings without problems, so I don't see how it could cause a UnicodeDecodeError.Osteopath
D
-3

Following code worked in my case:

df = pd.read_csv(filename,sep = '\t', encoding='cp1252')

Dentalium answered 18/6, 2022 at 23:10 Comment(0)
B
-6

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

Blanketyblank answered 13/8, 2017 at 13:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.