Saving a flat file as an SQL database in R without loading it 100% into RAM
Asked Answered
G

2

5

I hope that what I am about to write makes some sense. If you look at

How to deal with a 50GB large csv file in r language?

it is explained how to query à la SQL, a csv file from R. In my case, I have vast amount of data stored as large (or larger than my RAM) flat files.

I would like to storer for instance one of these as an SQLite database without loading it in memory entirely. Imagine if you could automatically read a limited chunk of that file which is suitable for your RAM, store it into a SQL, then free up some memory, process the next chunk and so on and so forth until all the file is in the database. Is this doable in R? If the table could be stored as tibble, it would be even better, but it is not crucial. Any suggestion is appreciated. Thanks!

Grassy answered 14/7, 2020 at 16:16 Comment(3)
It's certainly doable, whether with SQLite, SQL Server, Postgres, or any other DBMS, but I don't think you need sqldf: while it does provide accessing large amounts of data in a SQL way, it is presuming that the data is already resident in memory. What you're talking about here is only loading a portion of the data into memory at a time, which suggests DBI and RSQLite as the packages you need. You should probably figure out how to get the 50GB of data into the sqlite file, whether through R or direct import.Correy
I mean, you already seem to describe a reasonable strategy of reading chucks of a file at a time and importing them into a database. Have you tried it? Did it not work? it's hard to offer any more specific suggestions because you haven't really given any specific details of files/database you are using.Cockade
This might be useful: https://mcmap.net/q/143616/-import-csv-to-sqlite (importing a csv file directly without loading into R).Correy
G
0

Apparently there is already a function for that

https://raw.githubusercontent.com/inbo/inborutils/master/R/csv_to_sqlite.R

I am testing it. I do not see any progress bar even if the corresponding option is selected, but it appears to get the job done.

Grassy answered 14/7, 2020 at 18:22 Comment(1)
Disregard my previous comment, I missed the _chunked part of that function. I still think using sqlite-native ops would be much faster, but if this doesn't take too long for you, cool.Correy
T
6

1) dbWriteTable dbWriteTable can read the file into a database without going through R. The database is created if it does not already exist.

library(RSQLite)

cat("a,b\n1,2\n", file = "myfile.csv")  # create test file

con <- dbConnect(SQLite(), "mydb") 

dbWriteTable(con, "mytable", "myfile.csv")
dbGetQuery(con, "select count(*) from mytable")  # ensure it is there

dbDisconnect(con)

2) SQLite CLI We could alternately do it using the sqlite cli which can be downloaded from the sqlite download site.

https://www.sqlite.org/download.html

This does not involve R at all in creating the database. mydb will be created if it does not exist. This first line is entered at the shell or cmd prompt and that will provide its own prompt at which the remaining lines cana be entered.

sqlite3 mydb
.mode csv
.import myfile.csv mytable
.quit

3) Other database Another option is to use a database that has the ability to read csv files directly into it. H2 has csvread, MySQL has load data infile and PostgreSQL has copy.

Tegular answered 14/7, 2020 at 19:26 Comment(5)
I saw an entry in the RSQLite news file that suggested this was the case, back from 2016. I had never used it nor have I tried it, but that's a very interesting feature.Correy
I have to check this. What matters is to never have the whole file stored into memory (regardless of R) before creating the database.Grassy
Have added other options.Tegular
But the point is not just to avoid R. The point is not to ran out of RAM when handling larger than RAM raw datasets. Do your solutions address that?Grassy
Memory should not be a limitation with any of these solutions but you will have to try them to be really sure and it may or may not be that sqlite has performance problems at this scale.Tegular
G
0

Apparently there is already a function for that

https://raw.githubusercontent.com/inbo/inborutils/master/R/csv_to_sqlite.R

I am testing it. I do not see any progress bar even if the corresponding option is selected, but it appears to get the job done.

Grassy answered 14/7, 2020 at 18:22 Comment(1)
Disregard my previous comment, I missed the _chunked part of that function. I still think using sqlite-native ops would be much faster, but if this doesn't take too long for you, cool.Correy

© 2022 - 2024 — McMap. All rights reserved.