How to read only lines that fulfil a condition from a csv into R?

Asked 21/4, 2014 at 11:38 Answered 31/10, 2016 at 0:44

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset.

I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.

Certificate answered 21/4, 2014 at 11:38 Comment(4)

How 'large' is your file? Can't you start reading 1000 lines and subset them to your condition at every loop? – Mauser 21/4, 2014 at 12:37

Yes, I tried that. I was reading 1 million lines at a time. Each iteration was taking about 15 seconds, including adding the resulting "filtered" dataset to an existing dataframe. But given the size of the dataset I am dealing with, this solution would have taken more than 1 hour. As I wrote below, the solution I actually used (sqldf) took slightly less than one hour. Dirk's suggestion of awk for my problem would have taken about 2 hours. I will look into Python to accelereate some of these tasks. If someone has good pointers, let me know. Thanks everyone. – Certificate 22/4, 2014 at 9:8

Thanks. Nice to know it worked better with sqldf. Definitely something to retain. – Mauser 22/4, 2014 at 9:17

This is a very common question, but the 'best' answer for each use-case depends on what sort of condition, is it simple number/string-matching on one or more fields (use grep/awk), or does it require evaluating multiple fields (e.g. V2*V3 < mean(V4) & !is.na(V5))? If a simple grep/awk gets 90+% of the coarse size reduction done, it's one good way to go. – Braunstein 14/4, 2019 at 23:55

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where `Sepal.Length` > 5", eol = "\n")

Mountain answered 21/4, 2014 at 12:41 Comment(4)

How does this solve the "file too big for current memory" problem? – Shocking 21/4, 2014 at 12:42

The sqlite database used under the hood is by default a temporary file, hence no memory problem, I suppose. – Mountain 21/4, 2014 at 12:44

This was the fastest way to solve my problem in R. It took about 1 hour. Thanks! – Certificate 22/4, 2014 at 9:5

At least in my version of R (3.4.2) running on Linux, the above example did not work without the addition of apostrophe's around Sepal.Length, i.e. I needed to use `Sepal.Length`. – Callery 10/12, 2017 at 9:10

By far the easiest (in my book) is to use pre-processing.

R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+           header=FALSE)
  V1 V2
1 21  U
2 22  V
3 23  W
4 24  X
5 25  Y
6 26  Z
R>

Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

The output from that command can be read by R via pipe().

This is going to be faster and more memory-efficient than reading everythinb into R.

Niemeyer answered 21/4, 2014 at 13:16 Comment(11)

And will this work with a million(s) lines DF as well? – Mauser 21/4, 2014 at 15:40

Yes, sure. You only read the percentage satisfying the condition. awk processes this one line at a time. – Niemeyer 21/4, 2014 at 15:42

This is interesting but I can't get it working from RStudio with your simple example. Something with the awk filter but not sure (no lines available in input)... – Mauser 21/4, 2014 at 15:49

You need awk in your path, and I assumed a Unix layout here with /tmp. You need to adjust that as needed. – Niemeyer 21/4, 2014 at 15:50

Sure, that works too. In fact, on my Linux system awk is in fact gawk. – Niemeyer 21/4, 2014 at 16:14

Awesome @DirkEddelbuettel could you give me the name of your book. And second I have a similar problem but with a .txt file. It has in the first 300 rows with sql errors and after of these is the data separated by tab but I tried to read it and I only got a dataframe of one column. Could you suggest me any way to solve this. Thanks. – Eyestrain 21/4, 2014 at 17:25

I can run a modified string from command prompt: "awk "BEGIN {FS=\",\"} {if ($1 > 20) print $0}" /tmp/data.csv" works for me changing quotes from ' to " but I can't find a way to run it from R nor with pipeor system. – Mauser 21/4, 2014 at 21:26

I cut and pasted the actual command from a working session. I use outer "" for the actual string R passes to as an argument to pipe(). Inside the string, use single quotes for the argument to awk. Inside the single quote, escape quotes for the FS argument. – Niemeyer 21/4, 2014 at 21:29

@DirkEddelbuettel This is very helpful! Where can I find more info about awk? – Coverlet 21/4, 2014 at 22:15

Hi @DirkEddelbuettel, Thanks so much for this answer. I was not familiar with awk but this works. I accepted the answer below because it ended up being a bit faster for my particular problem (no idea why). But I wanted to thank for always teaching us newbies something useful. I bought your rcpp book last week. ;) – Certificate 22/4, 2014 at 9:4

Another external tool csvtk from bioinf.shenwei.me/csvtk may also be tried. – Mountebank 9/8, 2021 at 10:24

I was looking into readr::read_csv_chunked when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked does well and increasing the chunk size was beneficial. sqldf was only marginally faster than awk.

library(tidyverse)
library(sqldf)
library(data.table)
library(microbenchmark)

# Generate an example dataset with two numeric columns and 5 million rows
tibble(
  norm = rnorm(5e6, mean = 5000, sd = 1000),
  unif = runif(5e6, min = 0, max = 10000)
) %>%
  write_csv('medium.csv')

microbenchmark(
  readr  = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
  readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
  sqldf  = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
  awk    = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
  awk2   = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
  fread  = fread(cmd = "awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"),
  check  = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
  times  = 10L
)

# Updated 2020-05-29

# Unit: seconds
#   expr   min    lq  mean  median    uq   max neval
#  readr   2.6   2.7   3.1     3.1   3.5   4.0    10
# readr2   2.3   2.3   2.4     2.4   2.6   2.7    10
#  sqldf  14.1  14.1  14.7    14.3  15.2  16.0    10
#    awk  18.2  18.3  18.7    18.5  19.3  19.6    10
#   awk2  18.1  18.2  18.6    18.4  19.1  19.4    10
#  fread  17.9  18.0  18.2    18.1  18.2  18.8    10

# R version 3.6.2 (2019-12-12)
# macOS Mojave 10.14.6        

# data.table 1.12.8
# readr      1.3.1 
# sqldf      0.4-11

Jumpy answered 31/10, 2016 at 0:44 Comment(3)

Do you mind adding data.table::fread(cmd="awk ...") for benchmarking? – Nickeliferous 13/5, 2020 at 12:57

@Nickeliferous of course! Though, for this example awk does the heavy lifting. Is there a chunked reading capability in data.table? – Jumpy 30/5, 2020 at 4:59

I think another external tool called csvtk from bioinf.shenwei.me/csvtk may also be used with fread for testing. – Mountebank 9/8, 2021 at 10:22

You can read the file in chunks, process each chunk, and then stitch only the subsets together.

Here is a minimal example assuming the file has 1001 (incl. the header) lines and only 100 will fit into memory. The data has 3 columns, and we expect at most 150 rows to meet the condition (this is needed to pre-allocate the space for the final data:

# initialize empty data.frame (150 x 3)
max.rows <- 150
final.df <- data.frame(Variable1=rep(NA, max.rows=150), 
                       Variable2=NA,  
                       Variable3=NA)

# read the first chunk outside the loop
temp <- read.csv('big_file.csv', nrows=100, stringsAsFactors=FALSE)
temp <- temp[temp$Variable2 >= 3, ]  ## subset to useful columns
final.df[1:nrow(temp), ] <- temp     ## add to the data
last.row = nrow(temp)                ## keep track of row index, incl. header

for (i in 1:9){    ## nine chunks remaining to be read
  temp <- read.csv('big_file.csv', skip=i*100+1, nrow=100, header=FALSE,
                   stringsAsFactors=FALSE)
  temp <- temp[temp$Variable2 >= 3, ]
  final.df[(last.row+1):(last.row+nrow(temp)), ] <- temp
  last.row <- last.row + nrow(temp)    ## increment the current count
}

final.df <- final.df[1:last.row, ]   ## only keep filled rows
rm(temp)    ## remove last chunk to free memory

Edit: Added stringsAsFactors=FALSE option on @lucacerone's suggestion in the comments.

Graphology answered 21/4, 2014 at 13:6 Comment(5)

out of curiousity: say that while importing I realize that preallocating 150 rows wasn't enough, is there an efficient way to extend the rows of the final data.frame (say by other 150 rows)? – Asyut 21/4, 2014 at 13:24

just one small not: I would use the option stringsAsFactors = FALSE when you import the data: it might be that for categorical variables you don't read all the categories with the first chunks, and R doesn't make you add data for which you don't have categories... – Asyut 21/4, 2014 at 13:31

Good catch, thanks! I will edit the solution. To my knowledge there is no efficient way to add lines without R rewriting the whole thing, but if you do extend it (say by another 50) it will still only be one rewrite. – Graphology 21/4, 2014 at 13:40

I think stringsAsFactors = FALSE should be added to final.df too. I am surprised though that there is not base function to let you do something similar... – Asyut 21/4, 2014 at 17:29

I don't think it's necessary. It will be passed as a character value and will not conflict with the other chunks, also character. – Graphology 21/4, 2014 at 17:42

You can open the file in read mode using the function file (e.g. file("mydata.csv", open = "r")).

You can read the file one line at a time using the function readLines with option n = 1, l = readLines(fc, n = 1).

Then you have to parse your string using function such as strsplit, regular expressions, or you can try the package stringr (available from CRAN).

If the line met the conditions to import the data, you import it.

To summarize I would do something like this:

df = data.frame(var1=character(), var2=int(), stringsAsFactors = FALSE)
fc = file("myfile.csv", open = "r")

i = 0
while(length( (l <- readLines(fc, n = 1) ) > 0 )){ # note the parenthesis surrounding l <- readLines..

   ##parse l here: and check whether you need to import the data.

   if (need_to_add_data){
     i=i+1
     df[i,] = #list of data to import
  }

}

Asyut answered 21/4, 2014 at 12:40 Comment(4)

This will work, but it'll be relatively slow. It's almost always fastest to edit the source file before reading into R, e.g. using a simple text editor or tools like sed and awk – Shocking 21/4, 2014 at 12:41

So he did, but sometimes ... "Let It Bleed," Track 9 . – Shocking 21/4, 2014 at 12:44

You could make this a lot faster by reading in (say) 10,000 lines at a time. – Lippold 21/4, 2014 at 13:0

@Lippold beat me to it. You can read it by chunks. You should also pre-allocate df or it will take very long, basically rewriting the entire data on each iteration (millions of times). I've added a proof-of-concept solution. – Graphology 21/4, 2014 at 13:9

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags