R's read.csv prepending 1st column name with junk text [duplicate]
Asked Answered
A

1

51

I have exported data from a result grid in SQL Server Management Studio to a csv file. The csv file looks correct.

But when I read the data into an R dataframe using read.csv, the first column name is prepended with "ï..". How do I get rid of this junk text?

Example:

str(trainData)

'data.frame':   64169 obs. of  20 variables:    
 $ ï..Column1             : int  3232...   
 $ Column2                : int  4242...

The data looks something like this (nothing special) :

Column1,Column2
100116577,100116577
100116698,100116702

Armet answered 4/7, 2014 at 6:33 Comment(4)
the .. usually come from spaces being replaced by .'s. Is the i a part of the csv? I have only ever seen X being added to colnames when they start with a number.Lotson
Can you show a sample of the input data and the read.table command you used to read it?Bewray
You can also just replace it afterwords using regex. names(trainData)[1] <- gsub("[^A-Za-z0-9]", "", names(trainData)[1])Jostle
I just had this error and solved it by copying the dataset into a new .csv file - There were no spaces before the column names and I could not find another way to get rid of this symbolHuth
S
95

You've got a Unicode UTF-8 BOM at the start of the file:

http://en.wikipedia.org/wiki/Byte_order_mark

A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this

R is giving you the ï and then converting the other two into dots as they are non-alphanumeric characters.

Here:

https://stat.ethz.ch/pipermail/r-help/2014-February/370760.html

Duncan Murdoch suggests:

You can declare a file to be in encoding "UTF-8-BOM" if you want to ignore a BOM on input

So try your read.csv with fileEncoding="UTF-8-BOM" or persuade your SQL wotsit to not output a BOM.

Otherwise you may as well test if the first name starts with ï.. and strip it with substr (as long as you know you'll never have a column that does start like that genuinely...)

Shuster answered 4/7, 2014 at 7:7 Comment(3)
Tried read.csv("data.csv",encoding="UTF-8-BOM") but still getting the BOM. When saving results to file from sql server man studio, the default encoding is UTF-8. Changed the encoding to ANSII and it removed the BOM.Armet
If I create a file with a BOM I can't replicate your behaviour, so maybe its an operating system or a windows version thing. Using ANSI (or ASCII?) encoding will just make problems if you have any non-plain-english characters in your output... Could you post a sample file?Shuster
Important edit: the correct arg is fileEncoding= not encoding=, which is silently ignored by read.csv.Shuster

© 2022 - 2024 — McMap. All rights reserved.