Reading huge csv files into R with sqldf works but sqlite file takes twice the space it should and needs "vacuuming"
Asked Answered
S

1

7

Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.

I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.

As a example, the following code reads some sample data into sqlite:

# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new") 
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file", 
             dbname = "sample_db", header = T, row.names = F, sep = ",")

The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").

The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.

Is there a better solution to this that doesn't create this data duplication in the first time ?

Syllabize answered 12/9, 2012 at 15:3 Comment(1)
You likely want to use RSQLite directly for this as suggested by a responder but I have updated Examples 9 and 10 on the sqldf home page to show how to use sqldf persistent connections to accomplish this without duplication. See especially example 10c. Also note that sqldf accepts a vector of sql statements and won't destroy the database until the last statement in the vector has been executed.Otherdirected
S
10

Solution: using RSQLite without going through sqldf:

library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv", 
             row.names=FALSE, header=TRUE, sep = ",")
Syllabize answered 12/9, 2012 at 16:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.