Fastest way to subset - data.table vs. MySQL
Asked Answered
O

2

20

I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.

To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.

I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.

Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.

Oarsman answered 6/7, 2011 at 1:30 Comment(4)
I know some other folks here have more experience with this, so I'll let them expound on it with actual answers, but I suspect you'll want to look at the sqldf package which does precisely what you're describing, only it creates the table in memory (I think) so the queries may run quite a bit faster.Byler
thanks, joran! I'm keen to understand this in the context of large tables. It's all speculation, but I have gotten advice that speed issues I have could be due to memory management/limitations. After all, when using data.table, aren't those tables in memory as well?Oarsman
Indeed, for data that big memory will be an issue, but I believe sqldf can use disk db's as well. Again, I haven't used it much, I mentioned it because its an entire package built around the concept of pushing data to a db, performing sql and then returning it to R.Byler
If you are using sqldf with SQLite then sqldf("...sql statement...", dbname = tempfile()) uses disk but without the dbname= arg it uses memory. If you are using sqldf with MySQL via RMySQL then it uses dbname = "test" by default.Lithometeor
T
32

If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?

SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it should be faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix). So, please post some tests, after reading the data.table wiki.

Teahouse answered 6/7, 2011 at 9:27 Comment(1)
Doyle - Nice! I too am now heading to the wiki myself. I have always understood that the Db is faster for most querying, but now I can check out why, and what those bounds are. Sometimes one needs a point in the right direction . . . Thanks!Refit
R
3

I am not an R user, but I know a little about Databases. I believe that MySQL (or any other reputatble RDBMS) will actually perform your subsetting operations faster (by, like, an order of magnitude, usually) barring any additional computation involved in the subsetting process.

I suspect your performance lag on small data sets is related to the expense of the connection and initial push of the data to MySQL. There is likely a point at which the connection overhead and data transfer time adds more to the cost of your operation than MySQL is saving you.

However, for datasets larger than a certain minimum, it seem likley that this cost is compensated for by the sheer speed of the database.

My understanding is that SQL can acheive most fetching and sorting operations much, much more quickly than iterative operations in code. But one must factor in the cost of the connection and (in this case) the initial transfer of data over the network wire.

I will be interested to hear what others have to say . . .

Refit answered 6/7, 2011 at 1:40 Comment(2)
thanks for the post! just a clarification - I don't push datasets to MySQL each iteration; rather, I just do it once prior to running the function. So I only need to push out from R to MySQL for reach iteration is a value or a vector for the query to subset on.Oarsman
Hmm. I will still be interested in the reason behind the change in performance stats between "small" and "large" datasets. Possibly still related to connection overhead, even without the push? ( e.g. connection overhead as percentage of total execution time)Refit

© 2022 - 2024 — McMap. All rights reserved.