fread together with grepl
Asked Answered
C

2

8

I have a data (large data 125000 rows, ~20 MB) in which some of the rows with certain string need to be deleted and some columns need to be selected during the reading process.

Firstly, I discovered that grepl function does not work properly since fread makes the data as one column indicated also in this question.

The example data can be found here (by following @akrun advice) and header of the data like this

head(sum_data)

TRIAL :            1        3331        9091
  TRIAL :            2  1384786531   278055555
    2     0.10     0.000E+00 -0.0047 -0.0168 -0.9938    -0.0087 -0.0105 -0.9709     0.0035  0.0079 -0.9754     0.0081  0.0023  0.9997      -0.135324E-09    0.278754E-01
    2     0.20     0.000E+00 -0.0121  0.0002 -0.9898    -0.0364 -0.0027 -0.9925    -0.0242 -0.0050 -0.9929     0.0029 -0.0023  0.9998      -0.133521E-09    0.425567E-01
    2     0.30     0.000E+00  0.0193 -0.0068 -0.9884     0.0040  0.0139 -0.9782    -0.0158  0.0150 -0.9814     0.0054 -0.0008  0.9997      -0.134103E-09    0.255356E-01
    2     0.40     0.000E+00 -0.0157  0.0183 -0.9879    -0.0315 -0.0311 -0.9908    -0.0314 -0.0160 -0.9929     0.0040  0.0010  0.9998      -0.134819E-09    0.257300E-01
    2     0.50     0.000E+00 -0.0402  0.0300 -0.9832    -0.0093  0.0269 -0.9781    -0.0326  0.0247 -0.9802     0.0044 -0.0010  0.9997      -0.131515E-09    0.440350E-01

I attempted to read the data with fread and used grepl for removing the rows;

files <-dir(pattern = "*sum.txt",full.names = FALSE)
library(data.table)

fread_files <- function(files){
sum_data_read <- fread(files,skip=2, sep="\t", ) #seperation is tab.
df_grep <- sum_vgm_read [!grepl("TRI",sum_vgm_read$V1),] # for removing the lines that contain "TRIAL" letter in V1 column. But so far there is no V1 column is recognized!!

df <- bind_rows(df_grep)  #binding rows after removing 
write.table(as.data.table(df),file = gsub("(.*)(\\..*)", "\\1_new\\2", files),row.names = FALSE,col.names = TRUE) 
}

and finally lapply

lapply(files, fread_files)

when I perfom this, only one row of data is created as an output which is something going on but I dont know what. Thanks for help in advance!

Contagium answered 28/3, 2016 at 5:46 Comment(3)
Do you just want to read the file, delete rows and rewrite the files? Or do you want to have a datatable or dataframe for manipulation ?Tattan
@Titolondon thanks for asking. I want to write a new file not rewrite them and want to have data.frame with column names and faster reading processing since I have many files.Contagium
Did you try with my answer below? It seems to do what you want: 1. read file 2. remove rows 3. write in a new file whithout the "TRIAL" lines what is missing? And, by the way, I do not see colnames in your example data. What are the colnames you want?Tattan
E
16

Firstly, I discovered that grepl function does not work properly since fread makes the data as one column indicated also in this question.

But that question's accepted answer says that problem was fixed in v1.9.6. Which version are you using? That's why we ask you to please state the version number up front, to save time answering.

It is a great example file and the question is great.

I would not try to reinvent the wheel as operations like these have long been implemented as command line tools, which you can use together with fread directly. The advantage is that you won't churn through R memory, you can leave the filtering to the command tool and that can be much more efficient. For example, if you load all the lines as lines into R, those strings will be cached in R's global string cache (at least temporarily). Doing that filter outside R first will save that cost.

I downloaded your great file and tested the following which works.

> fread("grep -v TRIAL sum_data.txt")
         V1   V2 V3      V4      V5      V6      V7      V8      V9     V10     V11     V12    V13     V14    V15          V16       V17
     1:   2  0.1  0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709  0.0035  0.0079 -0.9754 0.0081  0.0023 0.9997 -1.35324e-10 0.0278754
     2:   2  0.2  0 -0.0121  0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
     3:   2  0.3  0  0.0193 -0.0068 -0.9884  0.0040  0.0139 -0.9782 -0.0158  0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
     4:   2  0.4  0 -0.0157  0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040  0.0010 0.9998 -1.34819e-10 0.0257300
     5:   2  0.5  0 -0.0402  0.0300 -0.9832 -0.0093  0.0269 -0.9781 -0.0326  0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
    ---                                                      

124247: 250 49.5  0 -0.0040  0.0141  0.9802 -0.0152  0.0203 -0.9877 -0.0015  0.0123 -0.9901 0.0069  0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6  0 -0.0006  0.0284  0.9819  0.0021  0.0248 -0.9920  0.0264  0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7  0  0.0378  0.0305  0.9779 -0.0261  0.0232 -0.9897 -0.0236  0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8  0  0.0569 -0.0203  0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9  0  0.0234 -0.0358  0.9840 -0.0340  0.0114 -0.9873 -0.0255  0.0134 -0.9888 0.0006  0.0009 0.9997 -1.30862e-10 0.0334976
>

The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.

For further examples on the use of command line tools in fread, you may check the article Convenience features of fread. Please note that "On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep".

Estabrook answered 28/3, 2016 at 20:24 Comment(12)
Your solution is elegant and thanks for appreciation of my question. However when I tried to test fread("grep -v TRIAL sum_data.txt") it says 'grep' is not recognized as an internal or external command, operable program or batch file. In addition: Warning messages: 1: running command 'C:\Windows\system32\cmd.exe /c (grep -v TRIAL sum_data.txt)Contagium
I am using data.table 1.9.6 version.Contagium
@Contagium On Windows, installing Cygwin should do the trick.Estabrook
@Contagium You can use select= parameter of fread to select columns by name or by number. See ?fread for all the flexible parameters; e.g. fread("grep -v TRIAL sum_data.txt", select=c(1,7,10)).Estabrook
Thanks man for your prompt response. So far I have trouble with installing Cygwin. But hope it will be solved soon. Thanks for your answer and time!Contagium
@Contagium A quick search online should resolve any problems. Good luck. A great many people in corporate environments use it.Estabrook
One more thing, what if I had a list say 20 files. When I replace sum_data.txt with files as I wrote in my question, I'am getting an error: grep: sumavgm: No such file or directory but for only single file the code works perfectly.Contagium
@Contagium Can either lapply fread through the files and rbindlist the result like this, or use command tools like this.Estabrook
I checked both links. Both links about reading with fread. They are giving solutions only reading fast I can see that. However I need to write new file after reading list.of files which they don't mention about that. How can I implement write function inside of rbindlist(lapply(dat.files, function(f) { read.delim(gzfile(f)) })) . If possible can you add this to your answer?Contagium
@Contagium I'm not quite following. Please could you ask a new question - that would get alternative new answers too and more eyes on it.Estabrook
I see. But it would be great if you add it to your solution since its included in my question.Contagium
@Contagium Utterly confused now. If you're not reading but writing you don't even need R for that. Just direct the output of grep to the output file. Please ask a new clear question.Estabrook
T
1

In order to read a file and remove row based on a string criteria, you could use readLines function, and filter the result.

I use stringr package for string manipulation.

library(stringr)
# Read your file by lines
DT <- readLines("sum_data") 
length(DT)
#> [1] 124501
# detect which lines contains trial
trial_lines <- str_detect(DT, "TRI")
head(trial_lines)
#> [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
# Remove those lines 
DT <- DT[!trial_lines]
length(DT)
#> [1] 124251
# Rewrite your file by line
writeLines(DT, "new_file")

If you have performance issues, you could try read_lines from package readr instead of base readLines

Tattan answered 28/3, 2016 at 9:8 Comment(1)
I tried your script and it is working! However, how can I select some specific columns after deleting TRIAL lines. Lets say V1, V7 and V10 when writing lines ?Contagium

© 2022 - 2024 — McMap. All rights reserved.