How to read only lines that fulfil a condition from a csv into R?
Asked Answered
C

5

32

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset.

I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.

Certificate answered 21/4, 2014 at 11:38 Comment(4)
How 'large' is your file? Can't you start reading 1000 lines and subset them to your condition at every loop?Mauser
Yes, I tried that. I was reading 1 million lines at a time. Each iteration was taking about 15 seconds, including adding the resulting "filtered" dataset to an existing dataframe. But given the size of the dataset I am dealing with, this solution would have taken more than 1 hour. As I wrote below, the solution I actually used (sqldf) took slightly less than one hour. Dirk's suggestion of awk for my problem would have taken about 2 hours. I will look into Python to accelereate some of these tasks. If someone has good pointers, let me know. Thanks everyone.Certificate
Thanks. Nice to know it worked better with sqldf. Definitely something to retain.Mauser
This is a very common question, but the 'best' answer for each use-case depends on what sort of condition, is it simple number/string-matching on one or more fields (use grep/awk), or does it require evaluating multiple fields (e.g. V2*V3 < mean(V4) & !is.na(V5))? If a simple grep/awk gets 90+% of the coarse size reduction done, it's one good way to go.Braunstein
M
36

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
Mountain answered 21/4, 2014 at 12:41 Comment(4)
How does this solve the "file too big for current memory" problem?Shocking
The sqlite database used under the hood is by default a temporary file, hence no memory problem, I suppose.Mountain
This was the fastest way to solve my problem in R. It took about 1 hour. Thanks!Certificate
At least in my version of R (3.4.2) running on Linux, the above example did not work without the addition of apostrophe's around Sepal.Length, i.e. I needed to use `Sepal.Length`.Callery
N
26

By far the easiest (in my book) is to use pre-processing.

R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+           header=FALSE)
  V1 V2
1 21  U
2 22  V
3 23  W
4 24  X
5 25  Y
6 26  Z
R> 

Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

The output from that command can be read by R via pipe().

This is going to be faster and more memory-efficient than reading everythinb into R.

Niemeyer answered 21/4, 2014 at 13:16 Comment(11)
And will this work with a million(s) lines DF as well?Mauser
Yes, sure. You only read the percentage satisfying the condition. awk processes this one line at a time.Niemeyer
This is interesting but I can't get it working from RStudio with your simple example. Something with the awk filter but not sure (no lines available in input)...Mauser
You need awk in your path, and I assumed a Unix layout here with /tmp. You need to adjust that as needed.Niemeyer
Sure, that works too. In fact, on my Linux system awk is in fact gawk.Niemeyer
Awesome @DirkEddelbuettel could you give me the name of your book. And second I have a similar problem but with a .txt file. It has in the first 300 rows with sql errors and after of these is the data separated by tab but I tried to read it and I only got a dataframe of one column. Could you suggest me any way to solve this. Thanks.Eyestrain
I can run a modified string from command prompt: "awk "BEGIN {FS=\",\"} {if ($1 > 20) print $0}" /tmp/data.csv" works for me changing quotes from ' to " but I can't find a way to run it from R nor with pipeor system.Mauser
I cut and pasted the actual command from a working session. I use outer "" for the actual string R passes to as an argument to pipe(). Inside the string, use single quotes for the argument to awk. Inside the single quote, escape quotes for the FS argument.Niemeyer
@DirkEddelbuettel This is very helpful! Where can I find more info about awk?Coverlet
Hi @DirkEddelbuettel, Thanks so much for this answer. I was not familiar with awk but this works. I accepted the answer below because it ended up being a bit faster for my particular problem (no idea why). But I wanted to thank for always teaching us newbies something useful. I bought your rcpp book last week. ;)Certificate
Another external tool csvtk from bioinf.shenwei.me/csvtk may also be tried.Mountebank
J
20

I was looking into readr::read_csv_chunked when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked does well and increasing the chunk size was beneficial. sqldf was only marginally faster than awk.

library(tidyverse)
library(sqldf)
library(data.table)
library(microbenchmark)

# Generate an example dataset with two numeric columns and 5 million rows
tibble(
  norm = rnorm(5e6, mean = 5000, sd = 1000),
  unif = runif(5e6, min = 0, max = 10000)
) %>%
  write_csv('medium.csv')

microbenchmark(
  readr  = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
  readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
  sqldf  = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
  awk    = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
  awk2   = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
  fread  = fread(cmd = "awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"),
  check  = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
  times  = 10L
)

# Updated 2020-05-29

# Unit: seconds
#   expr   min    lq  mean  median    uq   max neval
#  readr   2.6   2.7   3.1     3.1   3.5   4.0    10
# readr2   2.3   2.3   2.4     2.4   2.6   2.7    10
#  sqldf  14.1  14.1  14.7    14.3  15.2  16.0    10
#    awk  18.2  18.3  18.7    18.5  19.3  19.6    10
#   awk2  18.1  18.2  18.6    18.4  19.1  19.4    10
#  fread  17.9  18.0  18.2    18.1  18.2  18.8    10

# R version 3.6.2 (2019-12-12)
# macOS Mojave 10.14.6        

# data.table 1.12.8
# readr      1.3.1 
# sqldf      0.4-11
Jumpy answered 31/10, 2016 at 0:44 Comment(3)
Do you mind adding data.table::fread(cmd="awk ...") for benchmarking?Nickeliferous
@Nickeliferous of course! Though, for this example awk does the heavy lifting. Is there a chunked reading capability in data.table?Jumpy
I think another external tool called csvtk from bioinf.shenwei.me/csvtk may also be used with fread for testing.Mountebank
G
9

You can read the file in chunks, process each chunk, and then stitch only the subsets together.

Here is a minimal example assuming the file has 1001 (incl. the header) lines and only 100 will fit into memory. The data has 3 columns, and we expect at most 150 rows to meet the condition (this is needed to pre-allocate the space for the final data:

# initialize empty data.frame (150 x 3)
max.rows <- 150
final.df <- data.frame(Variable1=rep(NA, max.rows=150), 
                       Variable2=NA,  
                       Variable3=NA)

# read the first chunk outside the loop
temp <- read.csv('big_file.csv', nrows=100, stringsAsFactors=FALSE)
temp <- temp[temp$Variable2 >= 3, ]  ## subset to useful columns
final.df[1:nrow(temp), ] <- temp     ## add to the data
last.row = nrow(temp)                ## keep track of row index, incl. header

for (i in 1:9){    ## nine chunks remaining to be read
  temp <- read.csv('big_file.csv', skip=i*100+1, nrow=100, header=FALSE,
                   stringsAsFactors=FALSE)
  temp <- temp[temp$Variable2 >= 3, ]
  final.df[(last.row+1):(last.row+nrow(temp)), ] <- temp
  last.row <- last.row + nrow(temp)    ## increment the current count
}

final.df <- final.df[1:last.row, ]   ## only keep filled rows
rm(temp)    ## remove last chunk to free memory

Edit: Added stringsAsFactors=FALSE option on @lucacerone's suggestion in the comments.

Graphology answered 21/4, 2014 at 13:6 Comment(5)
out of curiousity: say that while importing I realize that preallocating 150 rows wasn't enough, is there an efficient way to extend the rows of the final data.frame (say by other 150 rows)?Asyut
just one small not: I would use the option stringsAsFactors = FALSE when you import the data: it might be that for categorical variables you don't read all the categories with the first chunks, and R doesn't make you add data for which you don't have categories...Asyut
Good catch, thanks! I will edit the solution. To my knowledge there is no efficient way to add lines without R rewriting the whole thing, but if you do extend it (say by another 50) it will still only be one rewrite.Graphology
I think stringsAsFactors = FALSE should be added to final.df too. I am surprised though that there is not base function to let you do something similar...Asyut
I don't think it's necessary. It will be passed as a character value and will not conflict with the other chunks, also character.Graphology
A
2

You can open the file in read mode using the function file (e.g. file("mydata.csv", open = "r")).

You can read the file one line at a time using the function readLines with option n = 1, l = readLines(fc, n = 1).

Then you have to parse your string using function such as strsplit, regular expressions, or you can try the package stringr (available from CRAN).

If the line met the conditions to import the data, you import it.

To summarize I would do something like this:

df = data.frame(var1=character(), var2=int(), stringsAsFactors = FALSE)
fc = file("myfile.csv", open = "r")

i = 0
while(length( (l <- readLines(fc, n = 1) ) > 0 )){ # note the parenthesis surrounding l <- readLines..

   ##parse l here: and check whether you need to import the data.

   if (need_to_add_data){
     i=i+1
     df[i,] = #list of data to import
  }

}
Asyut answered 21/4, 2014 at 12:40 Comment(4)
This will work, but it'll be relatively slow. It's almost always fastest to edit the source file before reading into R, e.g. using a simple text editor or tools like sed and awkShocking
So he did, but sometimes ... "Let It Bleed," Track 9 .Shocking
You could make this a lot faster by reading in (say) 10,000 lines at a time.Lippold
@Lippold beat me to it. You can read it by chunks. You should also pre-allocate df or it will take very long, basically rewriting the entire data on each iteration (millions of times). I've added a proof-of-concept solution.Graphology

© 2022 - 2024 — McMap. All rights reserved.