I was looking into readr::read_csv_chunked
when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked
does well and increasing the chunk size was beneficial. sqldf
was only marginally faster than awk
.
library(tidyverse)
library(sqldf)
library(data.table)
library(microbenchmark)
# Generate an example dataset with two numeric columns and 5 million rows
tibble(
norm = rnorm(5e6, mean = 5000, sd = 1000),
unif = runif(5e6, min = 0, max = 10000)
) %>%
write_csv('medium.csv')
microbenchmark(
readr = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
sqldf = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
awk = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
awk2 = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
fread = fread(cmd = "awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"),
check = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
times = 10L
)
# Updated 2020-05-29
# Unit: seconds
# expr min lq mean median uq max neval
# readr 2.6 2.7 3.1 3.1 3.5 4.0 10
# readr2 2.3 2.3 2.4 2.4 2.6 2.7 10
# sqldf 14.1 14.1 14.7 14.3 15.2 16.0 10
# awk 18.2 18.3 18.7 18.5 19.3 19.6 10
# awk2 18.1 18.2 18.6 18.4 19.1 19.4 10
# fread 17.9 18.0 18.2 18.1 18.2 18.8 10
# R version 3.6.2 (2019-12-12)
# macOS Mojave 10.14.6
# data.table 1.12.8
# readr 1.3.1
# sqldf 0.4-11
V2*V3 < mean(V4) & !is.na(V5)
)? If a simple grep/awk gets 90+% of the coarse size reduction done, it's one good way to go. – Braunstein