R: Best Practices - dplyr and odbc multi table actions (retrieved from SQL)
Asked Answered
H

1

7

Say you have your tables stores in an SQL server DB, and you want to perform multi table actions, i.e. join several tables from that same database.

Following code can interact and receive data from SQL server:

library(dplyr)
library(odbc)
con <- dbConnect(odbc::odbc(),
                 .connection_string = "Driver={SQL Server};Server=.;Database=My_DB;")

Table1 <- tbl(con, "Table1")
Table1 # View glimpse of Table1

Table2 <- tbl(con, "Table2")
Table2 # View glimpse of Table2

Table3 <- tbl(con, "Table3")

However, with a few results retrieved with the same connection, eventually following error occurs:

Error: [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt

My current googling skills have taking me to the answer that the backend does not support multiple active result sets (MARS) - I guess more than 2 active result sets is the maximum? (backend is DBI and odbc)

So, my question is: what is best practice if I want to collect data from several tables from an SQL DB?

  • Open a connection for each table?

  • Actively close the connection and open it again for the next table?

  • Does the backend support MARS to be parsed to the connection string?

Harrow answered 24/1, 2018 at 7:42 Comment(12)
write a sql query to join several tables from that same database and send the sql query directly to the DB?Rachellrachelle
Don't plan on doing database operations on your tables within R. Do the work in the database.Mamie
I think you need to add a %>% collect() after each tbl() call. Or use dbReadTable().Cynthiacynthie
I believe he's not using collect because he wants to do these operations on server side. Which is actually in line with @Tim's commentPepito
However @Mathias if you choose to close the connection, you will need to collect first. And I believe (I'm not sure) that joining tables through different connections is possible but I think it implies some costly copies in the background. How big is the data ?Pepito
I want to do the operations on the server side, as @Moody_Mudskipper says, but I would like to do the operations with dplyr's syntax, thus I need connections to several tables. The data is big (several billion rows, hence why I want to do the manipulation on the server side - after which I want to collect it and do the analysis in R). I might have been able to add a MARS attribute to the connection string, I'll test it tomorrow and get back if it works. Ideas are still welcome.Harrow
I don't work with billions of rows but I usually manage to filter / join just enough on server side so I can do most of the work in R. dbplyr is amazing but It's often misleading and unpredictable. In your case you could type the sql query to do your join/filter and use DBI::dbSendQuery or DBI::dbGetQuery.Pepito
Also try DRIVER={SQL Server Native Client 11.0}Pepito
Not only a change of driver, adding the attribute MultipleActiveResultsSets seems to do the job. I've gotten some good results with the connection specified in the OP edit.Harrow
@Moody_Mudskipper: Also, FYI see the new pool-package. It might fix the whole connection management brigade. Although, the drawback is that the database doesn't pop up in the connection tab for reference (yet).Harrow
great! Please consider moving your edit to a proper answer, so the question can be marked as answered and will be easier to find and browse by other users in the futurePepito
Of course, done!Harrow
H
4

To make a connection that can hold multiple result sets, I've had luck with following connection code:

con <- DBI::dbConnect(odbc::odbc(),
                      Driver = "SQL Server Native Client 11.0",
                      Server = "my_host",
                      UID = rstudioapi::askForPassword("Database UID"),
                      PWD = rstudioapi::askForPassword("Database PWD"),
                      Port = 1433,
                      MultipleActiveResultSets = "True",
                      Database = my_db)

On top of that, I found that the new pool-package can do the job:

pool <- dbPool(odbc::odbc(),
                      Driver = "SQL Server Native Client 11.0",
                      Server = "my_host",
                      UID = rstudioapi::askForPassword("Database UID"),
                      PWD = rstudioapi::askForPassword("Database PWD"),
                      Port = 1433,
                      MultipleActiveResultSets = "True",
                      Database = my_db)

It is quicker and more stable than the DBI connection, however, one minor drawback is that the database doesn't pop up in the connection tab for easy reference.

For both methods, remember to close the connection/pool when done. For the DBI-method its:

dbDisconnect(con)

Whereas the pool-method is closed by calling:

poolClose(pool)
Harrow answered 25/1, 2018 at 11:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.