I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.
Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :
Error in fread(txt, header = T, select = c("YYY", "MM", "DD", :
Not positioned correctly after testing format of header row. ch=' '
The first 2 lines and 7 rows of my data look like that :
YYYY MM DD HH mm 19490 40790
1991 10 1 1 0 1.046465E+00 1.568405E+00
So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.
I've tried to use a command like this to convert spaces in comma :
DT <- fread(
paste("sed 's/\\s\\+/,/g'", txt),
header=T,
select=c('HHHH','MM','DD','HH')
)
without success : the problem remains and it seems to be slow with the sed command.
Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?
Here is a (maybe) smallest reproducible example (newline char after 40790) :
txt<-print(" YYYY MM DD HH mm 19490 40790
1991 10 1 1 0 1.046465E+00 1.568405E+00")
testDT<-fread(txt,
header=T,
select=c("YYY","MM","DD","HH")
)
Thanks for your help !
UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.
UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().