R data.table fread command : how to read large files with irregular separators?
Asked Answered
I

5

4

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\\s\\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

without success : the problem remains and it seems to be slow with the sed command.

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

Thanks for your help !

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

Idalia answered 6/3, 2014 at 15:39 Comment(5)
+1 Nice example. I've added a link back here to the fread todo list at the top of fread.c to tackle soon hopefully. sed in the meantime then.Marcellemarcellina
Btw, whitespace delimiting is generally considered less robust than a character delimiter like , ; or |. Avoid whitespace delimiting if at all possible. I'm assuming you can't because the file has been provided to you.Marcellemarcellina
Thanks for your reply and the link back in the todo ! For the files, yes : it's some output of an obscure hydrological stochastic model.Idalia
Try running the sed as a separate step to creat a modified cleaned up file; then fread that; then see if that is any faster.Sanderlin
Clayton Stanley, I think that NeronLeVelu suggested the same approach. I've written a small function that take a sed command in argument (or use default command) to write formatted data into a temp file, then read it with fread().Idalia
T
4
sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g' 

for you sed

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

Thin answered 6/3, 2014 at 15:50 Comment(3)
Thanks for the sed. I don't know why, but when I use your command on this line " YYYY MM TEST2", I've got that : ",Y,Y,Y,Y,M,M,T,E,S,T,2,". At home I'm on mac os 10.9, there is maybe some incompatibilities ! But.. i've also tried with (homebrew) gnu-sed. Same thing. Your code make perfect sens : we replace any number of blank at the start of string by nothing (this part works alone) and we replace any number of blank by a comma. I've also tried on vim : same thing.. And also on a centOS server. I don't understand :) Maybe i'm too tired..Idalia
oops, normal, i put * instead of \{1,\} in second action, sorry, I make the correction in replyThin
I've just read your suggestion again and I saw that I've made the opposite in my code (update 2) : sed first then fread()... I've no idea how to do it in your way :/ What do you mean by "fork of the tools" ? Thanks again for the sed command !Idalia
S
5

Just committed to devel, v1.9.5. fread() gains strip.white argument with default TRUE (as opposed to base::read.table(), because it's more desirable). The example data is now added to tests.

With this recent commit:

require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm             19490             40790\n   1991 10  1  1  0      1.046465E+00      1.568405E+00")
#      V1 V2 V3 V4 V5           V6           V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10  1  1  0     1.046465     1.568405
sapply(ans, class)
#          V1          V2          V3          V4          V5          V6          V7 
# "character" "character" "character" "character" "character"   "numeric"   "numeric" 
Synder answered 16/9, 2015 at 0:42 Comment(0)
T
4
sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g' 

for you sed

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

Thin answered 6/3, 2014 at 15:50 Comment(3)
Thanks for the sed. I don't know why, but when I use your command on this line " YYYY MM TEST2", I've got that : ",Y,Y,Y,Y,M,M,T,E,S,T,2,". At home I'm on mac os 10.9, there is maybe some incompatibilities ! But.. i've also tried with (homebrew) gnu-sed. Same thing. Your code make perfect sens : we replace any number of blank at the start of string by nothing (this part works alone) and we replace any number of blank by a comma. I've also tried on vim : same thing.. And also on a centOS server. I don't understand :) Maybe i'm too tired..Idalia
oops, normal, i put * instead of \{1,\} in second action, sorry, I make the correction in replyThin
I've just read your suggestion again and I saw that I've made the opposite in my code (update 2) : sed first then fread()... I've no idea how to do it in your way :/ What do you mean by "fork of the tools" ? Thanks again for the sed command !Idalia
I
4

With the answers of NeronLeVelu and Clayton Stanlay, I completed the answer with a custom function, example data and some system.time() for comparison purpose. Those test were made on Mac os 10.9 and R 3.0.2. However, I've made the same test on a linux machine and the sed command was really slow to perform, compared with read.table() with nrows and colClasses pre-calculated. The fread part was really fast, about 5 seconds for 5e6 rows on both system.

library(data.table)


# create path to new temporary file
origData <- tempfile(pattern="origData",fileext=".txt")
# write table with irregular blank spaces separators.
write(paste0(" YYYY MM DD HH mm             19490             40790","\n",
                 paste(rep(" 1991 10  1  1  0      1.046465E+00      1.568405E+00", 5e6), 
                       collapse="\n"),"\n"),
      file=origData
)

# define column classes for read.table() optimization
colClasses <- c(rep('integer',5),rep('numeric',2))

# Function to count rows with command wc for read.table() optimization.
fileRowsCount <- function(file){
    if(file.exists(file)){
            sysCmd <- paste("wc -l", file)
            rowCount <- system(sysCmd, intern=T)
            rowCount <- sub('^\\s', '', rowCount)
        as.numeric(
                       strsplit(rowCount, '\\s')[[1]][1]
                      )
    }
}

# Function to sed data into temp file before importing with sed
sedFread<-function(file, sedCmd=NULL, ...){
    require(data.table)
    if(is.null(sedCmd)){
        #default : sed for convert blank separated table to csv. Thanks NeronLevelu !
        sedCmd <- "'s/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'"
    }
    #sed into temp file
    tmpPath<-tempfile(pattern='tmp',fileext='.txt')
    sysCmd<-paste('sed',sedCmd, file, '>',tmpPath)
    try(system(sysCmd))
    DT<-fread(tmpPath,...)
    try(system(paste('rm',tmpPath)))
    return(DT)
}

Mac OS results :

# First sed into temp file and then fread.
system.time(
DT<-sedFread(origData, header=TRUE)
)
> user  system elapsed
> 23.847   0.628  24.514

# Sed directly in fread command :
system.time(
DT <- fread(paste("sed 's/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'", origData),
            header=T)
)
> user  system elapsed
> 23.606   0.515  24.219


# read.table without nrows and colclasses
system.time(
DF<-read.table(origData, header=TRUE)
)
> user  system elapsed
> 38.053   0.512  38.565

# read.table with nrows an colclasses
system.time(
DF<-read.table(origData, header=TRUE, nrows=fileRowsCount(origData), colClasses=colClasses)
)
> user  system elapsed
> 33.813   0.309  34.125

Linux results :

# First sed into temp file and then fread.
system.time(
  DT<-sedFread(origData, header=TRUE)
)
> Read 5000000 rows and 7 (of 7) columns from 0.186 GB file in 00:00:05
> user  system elapsed 
> 47.055   0.724  47.789 

# Sed directly in fread command :
system.time(
DT <- fread(paste("sed 's/^[[:blank:]]*//;s/[[:blank:]]\\{1,\\}/,/g'", origData),
            header=T)
)
> Read 5000000 rows and 7 (of 7) columns from 0.186 GB file in 00:00:05
> user  system elapsed 
> 46.088   0.532  46.623 

# read.table without nrows and colclasses
system.time(
DF<-read.table(origData, header=TRUE)
)
> user  system elapsed 
> 32.478   0.436  32.912 

# read.table with nrows an colclasses
system.time(
DF<-read.table(origData,
               header=TRUE, 
               nrows=fileRowsCount(origData),
               colClasses=colClasses)
 )
> user  system elapsed 
> 21.665   0.524  22.192 

# Control if DT and DF are identical : 
setnames(DT, old=names(DT), new=names(DF))
identical(as.data.frame(DT), DF)                                                              
>[1] TRUE

In fine : the method I used in first place was the most efficient, in this case.

Thanks to NeronLeVelu, Matt Dowle and Clayton Stanley!

Idalia answered 10/3, 2014 at 11:1 Comment(0)
I
2

I've found another way to do it, much faster, with awk instead of sed. Here is another example :

library(data.table)

# create path to new temporary file
origData <- tempfile(pattern="origData",fileext=".txt")

# write table with irregular blank spaces separators.
write(paste0(" YYYY MM DD HH mm             19490             40790","\n",
            paste(rep(" 1991 10  1  1  0      1.046465E+00      1.568405E+00", 5e6),
            collapse="\n"),"\n"),
            file=origData
  )


# function awkFread : first awk, then fread. Argument : colNums = selection of columns. 
awkFread<-function(file, colNums, ...){
        require(data.table)
        if(is.vector(colNums)){
            tmpPath<-tempfile(pattern='tmp',fileext='.txt')
            colGen<-paste0("$",colNums,"\",\"", collapse=",")
            colGen<-substr(colGen,1,nchar(colGen)-3)
            cmdAwk<-paste("awk '{print",colGen,"}'", file, '>', tmpPath)
            try(system(cmdAwk))
            DT<-fread(tmpPath,...)
            try(system(paste('rm', tmpPath)))
            return(DT)
        }
}

# check read time :
system.time(
            DT3 <- awkFread(origData,c(1:5),header=T)
            )

> user  system elapsed 
> 6.230   0.408   6.644
Idalia answered 10/3, 2014 at 16:18 Comment(0)
L
1

If peak memory is not an issue or you can stream it in chunks that are manageable, the following gsub()/fread() hybrid should work, converting all continuous space characters to a single delimiter of your choosing (e.g. "\t"), prior to parsing by fread():

fread_blank = function(inputFile, spaceReplace = "\t", n = -1, ...){
  fread(
    input = paste0(
      gsub(pattern = "[[:space:]]+",
           replacement = spaceReplace,
           x = readLines(inputFile, n = n)),
      collapse = "\n"),
    ...)
}

I must agree with others that space-delimited files is not ideal choice, but I come across them pretty often whether I like it or not.

Lenette answered 26/1, 2016 at 0:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.