Create the SQL query "SELECT * FROM myTable LIMIT 10" using dplyr
Asked Answered
M

2

12

Suppose I have a connection to an external database called con.

I would like to use dplyr to reproduce this query

SELECT var1, var2, var3 from myTable LIMIT 10

I have tried

qry <- tbl(con, "myTable") %>%
    select(var1) %>%
    filter(between(row_number(), 1, 10)

but it does not give the desired result and the query it produces is much slower than the one I am after.

The query it produces is

SELECT "var1",
FROM SELECT "var1", row_number() OVER () AS "zzz25"
FROM SELECT "var1" AS "var1"
FROM myTable "yhydrqlhho") "zsdfunxidf"
WHERE zzz25 BETWEEN 1.0 AND 10.0)

When I send this query to the database it runs for a very long time. When I send

SELECT var1 from myTable limit 10 

the result comes back instantaneously.

Motherinlaw answered 27/10, 2017 at 2:29 Comment(11)
why would you want to use dplyr instead of using something like RODBC before editing the data using dplyr?Woosley
Your SQL query has a problem, because there is no ORDER BY clause. Saying LIMIT 10 without an ordering is meaningless.Gorki
dplyr (well, really dbplyr) will often generate sub-optimal SQL code b/c it has no full query analysis engine or other real SQL & back-end DB heuristics to do that. It's main goal is to make working with databases just like using normal dplyr for internal R data frames. You can actually do tbl(con, "SQL STATEMENT") to use an optimized query that returns a table dplyr can use.Bunni
Out of curiosity, does swapping the filter and select statements in your dplyr code lead to a speedup?Incalescent
@Bunni Interesting. Thanks for the tip. 98% of the time the SQL that dplyr creates works great but I'm having trouble getting the top 10 rows of a huge table efficiently with dplyr. It seems like this would be easy.Motherinlaw
@ScottRitchie Swapping filer and select produces basically the same query.Motherinlaw
@TimBiegeleisen I guess I never realized I needed ordering since the query always gave me a result I was happy with, namely 10 rows of my table.Motherinlaw
@Woosley I much prefer querying large tables in a database using dplyr rather to writing SQL. I just don't have a lot of experience writing SQL and find I can write my queries much more succinctly using dplyr.Motherinlaw
@AdamBlack yes - I'm wondering whether filter %>% select will get first 10 rows, then select appropriate columns, while select %>% filter presumably gets all rows in the DB for selected columns (i.e. very slow) then filters to first 10 rows.Incalescent
@MichaelChirico I edited the question to show what dplyr produces. I'm not sure why this is so much slower than using the LIMIT keyword but the execution time difference is substantial.Motherinlaw
@ScottRitchie This is the SQL generated when I do filter then select: SELECT "var1", FROM (SELECT * FROM (SELECT *, row_number() OVER () AS "zzz25" FROM myTable "yhydrqlhho") "zsdfunxidf" WHERE zzz25 BETWEEN 1.0 AND 10.0)Motherinlaw
O
18

You can try head(10), it generates the correct sql query on Postgres:

tbl(con, 'my_table') %>% select(var1, var2) %>% head(6) %>% explain()
# here con is a PostgreSQL connection

#<SQL>
#SELECT "var1" AS "var1", "var2" AS "var2"
#FROM "my_table"
#LIMIT 6
Overcasting answered 27/10, 2017 at 3:5 Comment(0)
I
7

If you're after the actual data from your query, rather than just recreating the SQL query, then specifying collect(n=10) will give the same output as @Psidom's answer.

tbl(con, 'my_table') %>% select(var1, var2) %>% collect(n=10)

# A tibble: 10 x 2
   var1  var2 
   <chr>       <dbl>   
 1 text1            87.8     
 2 text2            99.6    
 3 text3           100       
 4 text4            91.9     
 5 text5            76.8    
 6 text6            77.8    
 7 text7            77.2    
 8 text8            97.2  
 9 text9            97.5
10 text10            80.4

Note that the default in collect() is n = 1e+05, so if your data (after filtration) contains more rows, you'll need to specify collect(n=Inf) to retrieve it.

Invocation answered 6/7, 2018 at 2:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.