R using fread colClasses or skip arguments to read csv with no column headers
Asked Answered
S

1

11

I would like to be able to skip a column that is read into R via data.table's fread function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns?

Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped?

To give an example, I downloaded the data from the following URL

http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip

unzipped it…

and read the csv into R using fread and it has pretty much the same file name just with the csv extension.

system.time(pp <- fread("AUDUSD-2013-05.csv",sep=","))
  user  system elapsed 
16.427   0.257  16.682 

head(pp)
       V1                    V2      V3      V4
1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721
2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721
3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721
4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724
5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719
6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717

str(pp)
Classes ‘data.table’ and 'data.frame':  4060762 obs. of  4 variables:
$ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr  "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ...
$ V3: num  1.04 1.04 1.04 1.04 1.04 ...
$ V4: num  1.04 1.04 1.04 1.04 1.04 ...
- attr(*, ".internal.selfref")=<externalptr> 

I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary.

but doing:

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1)

doesn't omit the reading in of the first column

and using colClasses leads to the following error

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric"))

Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL,  : 
 colClasses is type list but has no names

other attempts incude

pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric"))
str(pp1)
Classes ‘data.table’ and 'data.frame':  5524877 obs. of  4 variables:
 $ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
 $ V2: chr  "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ...
 $ V3: num  0.962 0.962 0.962 0.962 0.962 ...
 $ V4: num  0.962 0.962 0.962 0.962 0.962 ...
 - attr(*, ".internal.selfref")=<externalptr>

i.e pretty much exactly the same as if I had not used colClasses...

Are there any suggestions to be able to speed up the reading in of data by omitting the first column?

Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv?

Oh and if it wasn't clear I'm using data.table v1.8.9

Signboard answered 10/7, 2013 at 9:28 Comment(8)
R usually creates its own column names like you have above V1,V2 etc. So if you can read in the data like above than I'm not sure what the problem is? You could try to do colClasses = NULL without declaring that its a list. also you could fill in colClasses = c(col1,col2,col3,col4,col5)Phrenology
skip argument is about skipping rows not columns; I'm not convinced that had there been a col.skip argument, you'd get a noticeable speed improvement - you'd still have to read in those chars before deciding to abandon themBilow
@SanderVanderZeeuw I tried your method...but got the following error: pp <- fread("AUDUSD-2013-05.csv",sep=",", colClasses=c(NULL,"character","numeric","numeric")) Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = c(NULL, "character", : colClasses is unnamed and length 3 but there are 4 columns. See ?data.table for more info about the colClasses argument.Signboard
@Signboard U must try colClasses=NULL and nothing else. R usually knows how to deal with certain columns. And 1 more question you dont need the entire CSV file thus only the last 3 columns?Phrenology
Yes I only want the last three columns....Signboard
@Signboard at least i know how to read in with full zip file takes on my dual-core laptop 10 sec to load the data. The reason you asked this question was because you need to speed up the reading right?? The following command works for me df<-fread(unzip('AUDUSD-2013-05.zip')) and than dfs <- df[2:4] rm(df) and youre memory will be sufficient.Phrenology
Yes it was to speed up the reading...but your solution doesn't speed up the actual reading from what I understand appart from just putting it all into a one line/function and then subsetting some columns after...What I am trying to do is NOT read the unnecessary first column to speed up the reading.Signboard
@Signboard what makes you think that not reading the first column will result in a (significant enough to care) speed up? You'd still have to read whatever is in the file to figure out where the column ends. What operation exactly do you think you won't do that will speed things up?Bilow
N
17

I think the argument you're looking for is drop. Try:

require(data.table)  # 1.9.2+
pp <- fread("AUDUSD-2013-05.csv", drop = 1)

Note that you can drop by name or position.

fread("AUDUSD-2013-05.csv", drop = c("columThree","anotherColumnName"))

fread("AUDUSD-2013-05.csv", drop = 10:15)  # read all columns other than 10:15

And you can select by name or position, too.

fread("AUDUSD-2013-05.csv", select = 10:15)  # read only columns 10:15

fread("AUDUSD-2013-05.csv", select = c("columnA","columnName2"))

These arguments were added to v1.9.2 (released to CRAN in Feb 2014) and are documented in ?fread. You'll need to upgrade to use them.

Nepos answered 11/7, 2014 at 4:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.