Read.CSV not working as expected in R
Asked Answered
A

4

7

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.

Here is the URL for the file

http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip

Here is my code to get the file, unzip, and read it in:

 URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
 download.file(URL, destfile="temp.zip")
 unzip("temp.zip")
 tmp <- read.table("sfa0910.csv", 
                   header=T, stringsAsFactors=F, sep=",", row.names=NULL)

Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.

However, what is strange is that the last column in R does appear to contain the proper data.

Here are a few rows from the first few columns:

tmp[1:5,1:7]
  row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1    100654      R     4496       R     1044       R       23
2    100663      R    10646       R     1496       R       14
3    100690      R      380       R        5       R        1
4    100706      R     6119       R      774       R       13
5    100724      R     4638       R     1209       R       26

Any thoughts on what I could be doing wrong?

Amphistylar answered 15/8, 2012 at 23:29 Comment(7)
Don't ever assume Excel correctly represents the contents of your CSV file. Open the CSV in a text editor instead (not that this is the cause of your problem, but as a general rule).Mausoleum
remove the row.names = NULL argument.Crossgrained
@ttmaccer - that's strange, you don't need to authenticate into the site. I just tried it from the web and it auto-downloaded the file to my computer. I am using Chrome.Amphistylar
@Crossgrained - I tried this as well, but it didn't work. What is strange is that is that the last column of data in R appears to be ok. It's not like I can just shift the column names.Amphistylar
@JoshuaUlrich - I know, I have seen examples of where Excel parses incorrectly. That said, the data do appear to be OK. Other versions of the data (past years) work fine, there is something strange with this particular file that I can't see by scanning the file in a text editor.Amphistylar
@ttmaccer - The numbers 100654 should be the values in the column UNITID. The values found in the UNITID as shown above belong to column XSCUGRAD. Thanks again.Amphistylar
I think this points to the problem, but I can't find the offending line in the csv dim(read.csv("sfa0910.csv", header = F, skip = 1)) is 6852 452 whereas length(unlist(strsplit(readLines('sfa0910.csv',1), ','))) is 451.Crossgrained
E
5

I have a fix maybe based on mnel's comments

dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451

all columns after the first have an extra seperator which excel ignores.

for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')

tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")

> tmp[1:5,1:7]
  UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654        R    4496        R    1044        R      23
2 100663        R   10646        R    1496        R      14
3 100690        R     380        R       5        R       1
4 100706        R    6119        R     774        R      13
5 100724        R    4638        R    1209        R      26

the moral of the story .... listen to Joshua Ulrich ;)

Quick fix. Open the file in excel and save it. This will also delete the extra seperators.

Alternatively

dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""), 
                   header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
Elsie answered 16/8, 2012 at 1:32 Comment(1)
Good call! This was my exact problem as wellWillams
T
17

My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.

First, count the number of fields using table():

table(count.fields("sfa0910.csv", sep = ","))
# 451  452 
#   1 6852

That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?

which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1

The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.

The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.

Thinker answered 16/8, 2012 at 2:8 Comment(1)
+1 for highlighting count.fields. A nice function for this sort of processing by the looks of it.Samite
E
5

I have a fix maybe based on mnel's comments

dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451

all columns after the first have an extra seperator which excel ignores.

for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')

tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")

> tmp[1:5,1:7]
  UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654        R    4496        R    1044        R      23
2 100663        R   10646        R    1496        R      14
3 100690        R     380        R       5        R       1
4 100706        R    6119        R     774        R      13
5 100724        R    4638        R    1209        R      26

the moral of the story .... listen to Joshua Ulrich ;)

Quick fix. Open the file in excel and save it. This will also delete the extra seperators.

Alternatively

dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""), 
                   header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
Elsie answered 16/8, 2012 at 1:32 Comment(1)
Good call! This was my exact problem as wellWillams
F
2

I know you've found an answer but as your answer helped me to find out this, I'll share:

If you read into R a file with different amount of columns for different rows, like this:

1,2,3,4,5
1,2,3,4
1,2,3

it would be read-in filling the missing columns with NAs, like this:

1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA

BUT! If the row with the biggest columns is not the first row, like this:

1,2,3,4
1,2,3,4,5
1,2,3

then it would be read in a bit confusing way:

1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA

(overwhelming before you figure out the problem and quite simple after!)

Just hope it may help someone!

Foraminifer answered 16/4, 2017 at 17:57 Comment(1)
best answer I have found!!Sanbo
N
-1

If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via setwd("C:/[User]/[MyFolder]") directly in your R-console.

Natatory answered 28/6, 2021 at 14:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.