I would like to be able to skip a column that is read into R via data.table
's fread
function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns?
Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped?
To give an example, I downloaded the data from the following URL
http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip
unzipped it…
and read the csv into R using fread and it has pretty much the same file name just with the csv extension.
system.time(pp <- fread("AUDUSD-2013-05.csv",sep=","))
user system elapsed
16.427 0.257 16.682
head(pp)
V1 V2 V3 V4
1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721
2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721
3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721
4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724
5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719
6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717
str(pp)
Classes ‘data.table’ and 'data.frame': 4060762 obs. of 4 variables:
$ V1: chr "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ...
$ V3: num 1.04 1.04 1.04 1.04 1.04 ...
$ V4: num 1.04 1.04 1.04 1.04 1.04 ...
- attr(*, ".internal.selfref")=<externalptr>
I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary.
but doing:
pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1)
doesn't omit the reading in of the first column
and using colClasses leads to the following error
pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric"))
Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL, :
colClasses is type list but has no names
other attempts incude
pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric"))
str(pp1)
Classes ‘data.table’ and 'data.frame': 5524877 obs. of 4 variables:
$ V1: chr "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ...
$ V3: num 0.962 0.962 0.962 0.962 0.962 ...
$ V4: num 0.962 0.962 0.962 0.962 0.962 ...
- attr(*, ".internal.selfref")=<externalptr>
i.e pretty much exactly the same as if I had not used colClasses...
Are there any suggestions to be able to speed up the reading in of data by omitting the first column?
Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv?
Oh and if it wasn't clear I'm using data.table v1.8.9
colClasses = NULL
without declaring that its a list. also you could fill incolClasses = c(col1,col2,col3,col4,col5)
– Phrenologyskip
argument is about skipping rows not columns; I'm not convinced that had there been acol.skip
argument, you'd get a noticeable speed improvement - you'd still have to read in those chars before deciding to abandon them – Bilowpp <- fread("AUDUSD-2013-05.csv",sep=",", colClasses=c(NULL,"character","numeric","numeric"))
Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = c(NULL, "character", : colClasses is unnamed and length 3 but there are 4 columns. See ?data.table for more info about the colClasses argument.
– Signboarddf<-fread(unzip('AUDUSD-2013-05.zip'))
and thandfs <- df[2:4]
rm(df)
and youre memory will be sufficient. – Phrenology