Reading UTF-8 with BOM using Python CSV module causes unwanted extra characters [duplicate]

About

Asked 18/11, 2015 at 16:34 Answered 18/11, 2015 at 16:48

python python-2.7 csv character-encoding byte-order-mark

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f:
   c = csv.reader(f)
   for row in c:
      print row

My example.txt has only the following content:

Hello world!

For UTF-8 or ANSI encoded files, this gives me the expected output:

> ["Hello world!"]

But if I save the file as UTF-8 with BOM I get this output:

> ["\xef\xbb\xbfHello world!"]

Since I do not have any control over what files the user will use as input, I would like this to work with BOM as well. How can I fix this problem? Is there anything I need to do to ensure that this works for other encodings as well?

Weaner answered 18/11, 2015 at 16:34 Comment(4)

NB: whatever solution you use, the important thing is to use utf-8-sig for decoding. – Krystinakrystle 18/11, 2015 at 16:44

import csv,csvkit,codecs,unicodecsv  with open("example.txt",'r') as f:     c = csv.reader(f)     for row in c:         print [unicode(s, "utf-8") for s in row]          with open("example.txt",'r') as f:     c = unicodecsv.reader(f)     for row in c:         print row                  with open("example.txt",'r') as f:     c = csvkit.reader(f)     for row in c:         print row

all prints [u'\ufeffHello world!'] so i ithink it is not duplicate- first try is using #17245915 – Alejoa 18/11, 2015 at 17:4

@ekhumoro: The duplicate is border line... Other question is about UTF-8 data while this one is specifically about BOM in utf8 file. The other page only speaks (in only one answer) of BOM for UTF-16 files. Your comment does answer this question but IMHO it would deserve to be an answer on a not duplicate question :-) – Buckbuckaroo 18/11, 2015 at 17:6

@SergeBallesta. Please read the question more carefully (esp. the last paragraph) - it's not only about the utf-8 signature. Also, the highest voted answer in the dup specifically uses utf-8-sig; but some of the other answers don't - which is why I added a comment here. – Krystinakrystle 18/11, 2015 at 17:20

You could make use of the unicodecsv Python module as follows:

import unicodecsv

with open('input.csv', 'rb') as f_input:
    csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig')
    print list(csv_reader)

So for an input file containing the following in UTF-8 with BOM:

c1,c2,c3,c4,c5,c6,c7,c8
1,2,3,4,5,6,7,8

It would display the following:

[[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']]

The unicodecsv module can be installed using pip as follows:

pip install unicodecsv

Entitle answered 18/11, 2015 at 16:48 Comment(2)

but what about \ufeff? is not it useless? – Alejoa 18/11, 2015 at 17:12

Indeed, I'd put the wrong encoding in, as stated utf-8-sig should be used. – Entitle 18/11, 2015 at 17:25

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags