Suppose I have a connection to an external database called con
.
I would like to use dplyr
to reproduce this query
SELECT var1, var2, var3 from myTable LIMIT 10
I have tried
qry <- tbl(con, "myTable") %>%
select(var1) %>%
filter(between(row_number(), 1, 10)
but it does not give the desired result and the query it produces is much slower than the one I am after.
The query it produces is
SELECT "var1",
FROM SELECT "var1", row_number() OVER () AS "zzz25"
FROM SELECT "var1" AS "var1"
FROM myTable "yhydrqlhho") "zsdfunxidf"
WHERE zzz25 BETWEEN 1.0 AND 10.0)
When I send this query to the database it runs for a very long time. When I send
SELECT var1 from myTable limit 10
the result comes back instantaneously.
ORDER BY
clause. SayingLIMIT 10
without an ordering is meaningless. – Gorkidplyr
(well, reallydbplyr
) will often generate sub-optimal SQL code b/c it has no full query analysis engine or other real SQL & back-end DB heuristics to do that. It's main goal is to make working with databases just like using normaldplyr
for internal R data frames. You can actually dotbl(con, "SQL STATEMENT")
to use an optimized query that returns a tabledplyr
can use. – Bunnifilter
andselect
statements in your dplyr code lead to a speedup? – Incalescent