Why is R reading UTF-8 header as text?
Asked Answered
C

4

14

I saved an Excel table as text (*.txt). Unfortunately, Excel don't let me choose the encoding. So I need to open it in Notepad (which opens as ANSI) and save it as UTF-8. Then, when I read it in R:

data <- read.csv("my_file.txt",header=TRUE,sep="\t",encoding="UTF-8")

it shows the name of the first column beginning with "X.U.FEFF.". I know these are the bytes reserved to tell any program that the file is in UTF-8 format. So it shouldn't appear as text! Is this a bug? Or am I missing some option? Thanks in advance!

Chairmanship answered 12/11, 2013 at 18:9 Comment(7)
try it with the read.csv argument check.names=FALSE. Note that if you use this, you will not be able to directly reference columns with the $ notation.Lunna
UTF-8 files are not supposed to contain a byte order mark, see RFC 3629 for explanation.Ferrule
Thanks @Matthew. It works partially. The X.U.FEFF is gone, but I can't refer to the first column by name anymore (the others still work, though). I still think this is a bug to be solved in future versions of R.Chairmanship
You can refer to them by name if you put them in quotes, e.g., yourdf$"first col"Lunna
@Zack, I've seen some UTF-8 files with these first bytes, so I thought it was a rule. Not a big problem, as I can always rename the first column, just think it should be solved someday.Chairmanship
@Matthew, this second trick didn't work here.Chairmanship
I found a solution at #24568556Overstride
F
17

So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file):

As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

which means that if you have a sufficiently new R interpreter,

read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)

should do what you want.

Ferrule answered 12/11, 2013 at 18:58 Comment(7)
hmmmm almost there. Now the "X.U.FEFF." became "ï.."Chairmanship
That looks like the file isn't actually UTF-8. Is there any way you can show us a hex dump of the first line of the file? (On most Unix systems, head -1 my_file.txt | hexdump -C will get you a nice hex dump, but I have no idea about a Windows equivalent.)Ferrule
In DOS Prompt, debug does this. The first three bytes are EF BB BF. (I saved the file in Notepad 5.1 build 2600, Windows XP SP3, and it says the format is UTF-8). The rest of the line is the ASCII for the column names.Chairmanship
I need to see the dump for the entire line (or at least the entire first field, i.e. up to and including the first 09), not just the first three bytes.Ferrule
EF BB BF 43 4F 4C 45 43 41 4F 09Chairmanship
Huh. After stripping the BOM, the first field is all ASCII uppercase letters, which should go into a data frame colname just fine. Do you in fact have R 3.x? This is starting to look like a bug in the interpreter.Ferrule
Yes, I have R 3.0.1. I downloaded Notepad++, and it gives me the option to save with and without the BOM. It seems R just can't handle the BOM.Chairmanship
H
4

most of the arguments in read.csv are dummy args -- including fileEncoding.

use read.table instead

 read.table("my_file.txt", header=TRUE, sep="\t", fileEncoding="UTF-8")
Haldan answered 12/11, 2013 at 18:17 Comment(3)
With read.table I get an error: "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 9191 did not have 25 elements". My read comment is actually more complicated, it is: data <- read.table("my_file.txt",header=TRUE,sep="\t",stringsAsFactors=FALSE,strip.white=TRUE,encoding="UTF-8",quote="")Chairmanship
great!! Then it worked. Now you just need to clean up your source file ;) Open it up in a plain text editor (I like sublime text 3), get down to line 9191 and inspect itHaldan
Thanks, @Ricardo. I only needed the comment.char="". But now it behaves exactly the same as read.csv... :(Chairmanship
P
1

I had the same issue loading a csv file using either read.csv (with encoding="UTF-87-BOM"), read.table or read_csv from the readr package. None of these attempt proved successful.

I could definitely not work with the BOM tag because upon sub setting my data (using both approaches subset() or df[df$var=="value",]), the first row was not taken into account.

I finally found a workaround that made the BOM tag vanish. Using the read.csv function, I just defined a string vector for my column names in the argument col.names = ... . This works like a charm and I can subset my data without issues.

I use R Version 3.5.0

Proclus answered 6/8, 2018 at 10:34 Comment(0)
L
0

Possible solution from the comments:

Try it with the read.csv argument check.names=FALSE. Note that if you use this, you will not be able to directly reference columns with the $ notation, unless you surround the name in quotes. For instance: yourdf$"first col".

Lunna answered 12/11, 2013 at 18:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.