Reading Tab Delimited Data in to R
Asked Answered
R

2

17

I am trying to read a large tab delimited file in to R.

First I tried this:

data <- read.table("data.csv", sep="\t")

But it is reading some of the numeric variables in as factors

So I tried to read in the data based on what type I want each variable to be like this:

data <- read.table("data.csv", sep="\t", colClasses=c("character","numeric","numeric","character","boolean","numeric"))

But when I try this it gives me an error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '"4"'

I think it might be that there are quotes around some of the numeric values in the original raw file, but I'm not sure.

Rubber answered 26/7, 2012 at 18:41 Comment(0)
W
8

Without seeing your data, you have one of a few things: you don't have all tabs separating the data; there are embeded tabs in single observations; or a litnay of others.

The way you can sort this out is to set options(stringsAsFactors=FALSE) then use your first line.

Check out str(data) and try to figure out which rows are the culprits. The reason some of the numeric values are reading as factors is because there is something in that column that R is interpreting as a character and so it coerces the whole column to character. It usually takes some digging but the problem is almost surely with your input file.

This is a common data munging issue, good luck!

Widower answered 26/7, 2012 at 18:46 Comment(5)
Thanks for the response. That helps but instead of importing the variables I want to import as numeric, it imports them as characters. When I try to convert it to a numeric variable, it gives me NA's for all the observations. I'll take a closer look at the data though to check out the suggestions you madeRubber
Oh, looking closer at your error you've got quoted four. "4" R has put an extra set of single quote around it ' " 4 " '. this means in your tsv file, your numbers are quoted and thus treated as char. add quote='"' to your read.table line and see how that works for you.Widower
The problem is definitely that in my raw data file the values are enclosed with quotation marks so it is reading the values as characters when they should be numeric. I tried the quote='"' you mentioned above but that does not fix the problem...The raw data file is also too big so I cannot remove the quotes in a text editor or excel without crashing the programsRubber
If you're on Linux or Unix, you can use the command line tool sed. sed -i s/\"//g filename which will remove all instances of ". But that might not be what you want...Widower
Yep, I was able to remove the quotes using the command line. Then I could read in the data and convert it correctly to numeric. Thanks!Rubber
C
1
x <- paste("'",floor(runif(10,0,10)),"'",sep="")
x

 [1] "'7'" "'3'" "'0'" "'3'" "'9'" "'1'" "'4'" "'8'" "'5'" "'8'"

as.numeric(gsub("'", "",x))

 [1] 7 3 0 3 9 1 4 8 5 8
Canaster answered 26/7, 2012 at 21:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.