Is there a way to turn off DESCRIBE in R dplyr sql
Asked Answered
M

2

9

I'm using R shiny and dplyr to connect to a database and query the data in Impala. I do the following.

con <- dbPool(odbc(),
Driver = [DIVER],
Host = [HOST],
Schema = [SCHEMA],
Port = [PORT],
UID = [USERNAME],
PWD = [PASSWORD])

table_foo <- tbl(con, [TABLE_NAME_FOO])
table_bar <- tbl(con, [TABLE_NAME_BAR])

When I run a query for instance:

table %>% filter(name=greg) %>% collect()

There are describes that are being run on the impala:

DESCRIBE TABLE 'table_foo'
DESCRIBE TABLE 'table_bar'

All the describes run before every query.(Every collect()) I have many tables all these describes are wasting substantial amount of time. Especially on impala where some describes can take a while to run.

Is there a way to turn these off? Is there anything I can do? I looked at the docs and could not find anything: https://db.rstudio.com/dplyr/

Muscle answered 12/8, 2019 at 19:3 Comment(3)
R/dplyr uses describe table to get metadata about resultset so it is a substantial part of querying tables.Vincentia
I mean the problem here is that it's a little ridiculous that: 1) It literally runs the describes for tables it's not going to query. 2) Even runs that describes when I do dbGetQuery instead of using dplyr. I don't know of any other orm that does this as a default config?Muscle
Some more information would assist. Please can you add to your question the results from table %>% filter(name=greg) %>% explain()? Thanks.Neurophysiology
N
1

My hypothesis is that this is a result of using the pool package and the way it manages multiple connections to a database. However your question does not give enough details for me to be certain, and I can not recreate your situation in my local environment.

If the problem is a result of the pool package, my suggestion would be to switch to using the dbplyr package directly. This provides a translation of dplyr commands into the database language.

# packages
library(DBI)
library(dplyr)
library(dbplyr)

# Connection string (see note below)
connection_string = "DRIVER = ...; DATABASE = ...; HOST = ..."
# connection
con = dbConnect(odbc::odbc(), .connection_string = connection_string)

# connect to two tables
table_foo <- tbl(con, "TABLE_NAME_FOO")
table_bar <- tbl(con, "TABLE_NAME_BAR")

# review queries
table_foo %>% explain()
table_foo %>% filter(name=greg) %>% explain()
# fetch data
table_foo %>% filter(name=greg) %>% collect()

This link appears to provide instruction as to how to create an Impala connection string. But given your setup, you might have an equivalent way.

If you try the above, and still observe DESCRIBE TABLE being run, then it could be that describing tables is required for returning results as @Lukasz suggests - and the nature of an Impala connection is that it can not tell which tables are needed.

You also mention in your opening line that you are using R Shiny. The dynamic nature of shiny apps means it is not always intuitive which parts of the code run. So if you have not tested the above in isolation, it is possible that DESCRIBE TABLE is being called by a different part of your shiny app that you did not realize was being run.

Neurophysiology answered 21/8, 2019 at 5:17 Comment(0)
C
1

Maybe something like this ? I can't reproduce your issue so sorry if it doesn't help :

lazy_tbl <- table %>% filter(name=greg)
sql_query <- dbplyr::sql_render(lazy_tbl) %>% gsub("\\n"," ", .)
# print sql_query and see if there are DESCRIBE statements to get out
output <- DBI::dbGetQuery(con, sql_query)

Then once you get it right wrap it up in a custom collect function :

collect2 <- function(x){
  dbplyr::sql_render(x) %>% 
    gsub("\\n"," ", .) %>%
    # other steps ?
    DBI::dbGetQuery(x$src$con, .)
}
Canonist answered 22/8, 2019 at 9:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.