I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.
To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.
I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.
Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.
sqldf
package which does precisely what you're describing, only it creates the table in memory (I think) so the queries may run quite a bit faster. – Bylersqldf("...sql statement...", dbname = tempfile())
uses disk but without thedbname=
arg it uses memory. If you are using sqldf with MySQL via RMySQL then it usesdbname = "test"
by default. – Lithometeor