Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql
from package sqldf
. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db")
.
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db")
. This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?