Dealing with UTF-8 numbers in Python
Asked Answered
B

2

11

Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:

with open(file) as f:
    a,b,c=map(int,f.readline().split(','))

would throw this:

invalid literal for int() with base 10: '\xef\xbb\xbf115'

The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.

Is there a better way of doing this for any type of encoded file?

Brei answered 1/3, 2010 at 23:16 Comment(0)
O
17
import codecs

with codecs.open(file, "r", "utf-8-sig") as f:
    a, b, c= map(int, f.readline().split(","))

This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

Orthopedics answered 2/3, 2010 at 0:1 Comment(4)
Thanks. This works on my UTF-8 files but fails on the Unicode and Unicode big endian. Is there a foolproof way of opening any kind of encoded file and getting those numbers or I would having to explicitly specify the encoding?Newsmagazine
AFAIK you have to specify the encoding. Obviously, you can write a small function that does the three tests and returns an appropriately decoded file.Orthopedics
Great. I found the chardet module that does exactly this chardet.feedparser.orgNewsmagazine
Minor error on your code above: a, b, c= map(int,f.readline().split(","))Newsmagazine
M
13

What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF character if present.

Membranophone answered 1/3, 2010 at 23:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.