Python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Asked Answered
K

4

3

I am fetching data from a catalog and it's giving data in bytes format.

Bytes data:

b'\x80\x00\x00\x00\n\x00\x00%\x83\xa0\x08\x01\x00\xbb@\x00\x00\x05p 
\x02\x00>\xf3\x00\x00\x00}\x02\x00`\x03\xef0\x00\x00\r\xc0 
\x06\xf0>\xf3\x00\x00\x02\x88\x02\x03\xec\x03\xef0\x00\x00/.....'

While converting this data in string or any readable format I'am getting this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Code which I used(Python 3.7.3):

blobs = blob.decode('utf-8')

AND

import json
json.dumps(blob.decode())

I've also used pickle, ast and pprint but they are not helpful here.

What I tried:

Krysta answered 3/6, 2020 at 10:27 Comment(5)
This is not readable data, so you can't decode it as utf-8.Fault
@Fault so is there any way to make this readable?Krysta
@Fault this may be readable data, but it's definitely not utf-8.Allerie
is there any way to make this readable? - What you show in the question is the in my opinion best way to make it readable.Allerie
You say from a catalog -- please add what catalog you are/were using.Allerie
A
4

The UTF-8 encoding has some built-in redundancy that serves at least two purposes:

1) locating code points reading back and forth

Start bytes (in binary dots carrying actual data) match one of these 4 patterns

0.......
110.....
1110....
11110...

whereas continuation bytes (0 to 3) have always this form

10......

2) checking for validity

If this encoding is not respected, it is safe to say that it is not UTF-8 data, e.g. because corruptions occurred during a transfer.

Conclusion

Why is it possible to say that b'\x80\' cannot be UTF-8? Already at the first two bytes the encoding is violated: because 80 must be a continuation byte. This is exactly what your error message says:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

And even if you skip this one, you get another problem some bytes later at b'%\x83', so it's most likely that either you are trying to decode the wrong data or assume the wrong encoding.

Allerie answered 3/6, 2020 at 17:43 Comment(3)
If you donvoted, please explain what's wrong with this answer. Thanks!Allerie
Not my downvote, but merely demonstrating that UTF-8 is not the correct encoding for this data is a halfway answer at best. It seems obvious to me that the input is not text at all, so discussing the features of one particular text encoding seems completely tangential.Witchcraft
@Witchcraft Thanks for the hint. I'm maybe a bit too obsessed of UTF-8 here. My idea was to show that we can be sure about the correctness of the error message.Allerie
L
3

You can try ignoring the non-readable blocks.

blobs.decode('utf-8', 'ignore')

It's not a great solution but the way you're generating the byte object has some issues. Maybe, utf-8 is not the proper encoding for your data.

Lobelia answered 3/6, 2020 at 10:32 Comment(2)
There are ` in your byte which is a problem for string in python. Those may have to replaced with \`.Lobelia
This is wrong on so many levels. Effectively extracting just the ASCII bytes from data which is predominantly not ASCII is hardly ever going to be useful.Witchcraft
W
2

The data in your example is clearly not text in any common encoding. Neither Python nor we can figure out a way to turn data which is obviously not text into strings.

If this is a well-defined binary file format, find a parser for this format (ideally a popular Python library, but for more obscure or proprietary formats you may not be able to find one) or write one yourself if you can figure out how the data is structured, either by clever experimentation and good guesswork, or by finding (if not authoritative then perhaps more or less speculative third-party) documentation.

If you simply want to turn the bytes into a string of code points with the same Unicode code points (so that for example the input byte \xff maps to the Unicode code point U+00FF), the 'latin-1' encoding does this, obscurely but conveniently. The result in this case will obviously not be useful human-readable text; in many ways, it would then be more natural and quite possibly less error-prone and more convenient to just keep the data as bytes instead.

Witchcraft answered 6/7, 2021 at 12:12 Comment(1)
Maybe a hexdump could also help. But I think the question what catalog was used will be the key.Allerie
E
0

For this encoding error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

or other like that, you just have to open the database file with .json extension and change the encoding to UTF-8 (for exemple in VScode, you can change it in right-bottom nav-bar) and save the file...

Now run

 $ git status

you'll have something like this result

 On branch master
 Changes not staged for commit:
   (use "git add <file>..." to update what will be committed)
   (use "git restore <file>..." to discard changes in working directory)
        modified:   store/dumps/store.json
   (use "git add <file>..." to include in what will be committed)
        .gitignore

 no changes added to commit (use "git add" and/or "git commit -a")

or something like this one

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   store/dumps/store.json
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .gitignore

for the first case, you just have to do this one

$ git add store/dumps/

the second case don't need this previous part...

Now, for the two cases, you have to commit the changes with

$ git commit -m "launching to production"

the console will return you a message informed you for the adds and changes...

You have to build log for the app again with

$ git push heroku master

(for heroku users)

after the build, you just have to load the database again with

heroku run python manage.py loaddata store/dumps/store.json

it will install the objects./.

excuses for my english level !!!

Exemplification answered 9/10, 2020 at 19:15 Comment(1)
This seems quite misdirected. A JSON file should by definition contain UTF-8 already. I guess you assume the file contains a different encoding, and probably that the OP is a Windows user.Witchcraft

© 2022 - 2024 — McMap. All rights reserved.