Reading UTF-8 with BOM using Python CSV module causes unwanted extra characters [duplicate]
Asked Answered
W

1

13

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f:
   c = csv.reader(f)
   for row in c:
      print row

My example.txt has only the following content:

Hello world!

For UTF-8 or ANSI encoded files, this gives me the expected output:

> ["Hello world!"]

But if I save the file as UTF-8 with BOM I get this output:

> ["\xef\xbb\xbfHello world!"]

Since I do not have any control over what files the user will use as input, I would like this to work with BOM as well. How can I fix this problem? Is there anything I need to do to ensure that this works for other encodings as well?

Weaner answered 18/11, 2015 at 16:34 Comment(4)
NB: whatever solution you use, the important thing is to use utf-8-sig for decoding.Krystinakrystle
import csv,csvkit,codecs,unicodecsv with open("example.txt",'r') as f: c = csv.reader(f) for row in c: print [unicode(s, "utf-8") for s in row] with open("example.txt",'r') as f: c = unicodecsv.reader(f) for row in c: print row with open("example.txt",'r') as f: c = csvkit.reader(f) for row in c: print row all prints [u'\ufeffHello world!'] so i ithink it is not duplicate- first try is using #17245915Alejoa
@ekhumoro: The duplicate is border line... Other question is about UTF-8 data while this one is specifically about BOM in utf8 file. The other page only speaks (in only one answer) of BOM for UTF-16 files. Your comment does answer this question but IMHO it would deserve to be an answer on a not duplicate question :-)Buckbuckaroo
@SergeBallesta. Please read the question more carefully (esp. the last paragraph) - it's not only about the utf-8 signature. Also, the highest voted answer in the dup specifically uses utf-8-sig; but some of the other answers don't - which is why I added a comment here.Krystinakrystle
E
6

You could make use of the unicodecsv Python module as follows:

import unicodecsv

with open('input.csv', 'rb') as f_input:
    csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig')
    print list(csv_reader)

So for an input file containing the following in UTF-8 with BOM:

c1,c2,c3,c4,c5,c6,c7,c8
1,2,3,4,5,6,7,8

It would display the following:

[[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']]

The unicodecsv module can be installed using pip as follows:

pip install unicodecsv
Entitle answered 18/11, 2015 at 16:48 Comment(2)
but what about \ufeff? is not it useless?Alejoa
Indeed, I'd put the wrong encoding in, as stated utf-8-sig should be used.Entitle

© 2022 - 2024 — McMap. All rights reserved.