Reading huge csv files into R with sqldf works but sqlite file takes twice the space it should and needs "vacuuming"

Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.

I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.

As a example, the following code reads some sample data into sqlite:

# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new") 
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file", 
             dbname = "sample_db", header = T, row.names = F, sep = ",")

The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").

The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.

Is there a better solution to this that doesn't create this data duplication in the first time ?

Recommended topics

Hot tags