Optimize/Vectorize Database Query with R
Asked Answered
A

2

3

I am attempting to use R to query a large database. Due to the size of the database, I have written the query to fetch 100 rows at a time My code looks something like:

library(RJDBC)
library(DBI)
library(tidyverse)

options(java.parameters = "-Xmx8000m")

drv<-JDBC("driver name", "driver path.jar")

conn<-
  dbConnect(
    drv, 
    "database info",
    "username",
    "password"
)

query<-"SELECT * FROM some_table"

hc<-tibble()
res<-dbSendQuery(conn,query)
repeat{
  chunk<-dbFetch(res,100)
  if(nrow(chunk)==0){break}
  hc<-bind_rows(hc,chunk)
  print(nrow(hc))
}

Basically, I would like write something that does the same thing, but via the combination of function and lapply. In theory, given the way R processes data via loops, using lapply will speed up query. Some understanding of the dbFetch function may help. Specifically, how in the repeat loop it doesn't just keep selecting the first initial 100 rows.

I have tried the following, but nothing works:

df_list <- lapply(query , function(x) dbGetQuery(conn, x)) 

hc<-tibble()
res<-dbSendQuery(conn,query)
test_query<-function(x){
  chunk<-dbFetch(res,100)
  if(nrow(chunk)==0){break}
  print(nrow(hc))
}
bind_rows(lapply(test_query,res))
Arterio answered 3/12, 2019 at 17:55 Comment(0)
A
1

The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.

I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.

Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.

Solution:

library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)

res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150

run_chunks<-
  function(chunk_number, res, chunk_size) {

    chunk <- 
     tryCatch(
      dbFetch(res, chunk_size),   
      error = function(e) NULL
     )

   if(!is.null(chunk)){
      return(chunk)
      }
    }


dat<-
  bind_rows(
    lapply(
      1:chunk_number,
      run_chunks,
      res,
      chunk_size
      )
    )
Arterio answered 6/12, 2019 at 21:44 Comment(2)
I'm playing with this now but I am eager to think about ways to parallelize and accelerate this! A query that runs in a few minutes directly in MS SQL Server directly on the server that hosts the DB takes over an hour to return the full d.f. in R, and I want to find a better workflow!Capone
@michael that’s the whole point of the question lol. Feel free to post if you find a better solution!Arterio
C
4

Consider following the example in dbFetch docs that checks for completed status of fetch, dbHasCompleted. Then, for memory efficiency build a list of data frames/tibbles with lapply then row bind once outside the loop.

rs <- dbSendQuery(con, "SELECT * FROM some_table")

run_chunks <- function(i, res) {
  # base::transform OR dplyr::mutate
  # base::tryCatch => for empty chunks depending on chunk number
  chunk <- tryCatch(transform(dbFetch(res, 100), chunk_no = i),    
                    error = function(e) NULL)
  return(chunk)
}

while (!dbHasCompleted(rs)) {
  # PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)
  df_list <- lapply(1:5, run_chunks, res=rs)                      
}

# base::do.call(rbind, ...) OR dplyr::bind_rows(...) 
final_df <- do.call(rbind, df_list)

Demonstration with in-memory SQLite database of mtcars:

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "mtcars", mtcars)

run_chunks <- function(i, res) {
  chunk <- dbFetch(res, 10)
  return(chunk)
}

rs <- dbSendQuery(con, "SELECT * FROM mtcars")

while (!dbHasCompleted(rs)) {
  # PROVIDE SUFFICIENT NUMBER OF CHUNKS (table rows / fetch rows)

  df_list <- lapply(1:5, function(i) 
    print(run_chunks(i, res=rs))
  )
}

do.call(rbind, df_list)

dbClearResult(rs)
dbDisconnect(con)

Output (5 chunks of 10 rows, 10 rows, 10 rows, 2 rows, 0 rows, and full 32 rows)

#     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
# 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

#     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# 2  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
# 3  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
# 4  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
# 5  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
# 6  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
# 7  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# 8  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# 9  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# 10 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

#     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# 2  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
# 3  15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
# 4  13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
# 5  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# 6  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
# 7  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# 8  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# 9  15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
# 10 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

#    mpg cyl disp  hp drat   wt qsec vs am gear carb
# 1 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
# 2 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

# [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
# <0 rows> (or 0-length row.names)

do.call(rbind, df_list)
#     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
# 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
# 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
# 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
# 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
# 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
# 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
# 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
# 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
# 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
# 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
# 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
# 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
# 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
# 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
# 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
# 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
# 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
# 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
# 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
# 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
# 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
# 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
# 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
# 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Chromatolysis answered 3/12, 2019 at 19:0 Comment(2)
the while loop doesn't work quite right as dbHasCompleted appears to be TRUE as soon as the connection is established (not when the query is complete). Also, the data sets I deal with range in size quite a bit, so I wanted to give more flexibility as I queried. Moreover, there is no reason to add chunk_no as a column, as it adds process time. tryCatch was a good idea though. I was able to build an effective solution using it (see below).Arterio
Very interesting as I tested this code before posting! See extension demo, printing out every chunk and final data frame.Chromatolysis
A
1

The following works well, as it allows the user to customize the size and number of chunks. Ideally, the function would be Vectorized somehow.

I explored getting the number of rows to automatically set the chunk number, but I couldn't find any methods without actually needing to perform the query first. Adding a large number of chunks doesn't add a ton of extra process time. The performance improvement over the repeat approach depends on the size of the data, but the bigger the data the bigger the performance improvement.

Chunks of n = 1000 seem to consistently produce the best results. Any suggestions to these points would be much appreciated.

Solution:

library(RJDBC)
library(DBI)
library(dplyr)
library(tidyr)

res<-dbSendQuery(conn,"SELECT * FROM some_table")
##Multiplied together need to be greater than N
chunk_size<-1000
chunk_number<-150

run_chunks<-
  function(chunk_number, res, chunk_size) {

    chunk <- 
     tryCatch(
      dbFetch(res, chunk_size),   
      error = function(e) NULL
     )

   if(!is.null(chunk)){
      return(chunk)
      }
    }


dat<-
  bind_rows(
    lapply(
      1:chunk_number,
      run_chunks,
      res,
      chunk_size
      )
    )
Arterio answered 6/12, 2019 at 21:44 Comment(2)
I'm playing with this now but I am eager to think about ways to parallelize and accelerate this! A query that runs in a few minutes directly in MS SQL Server directly on the server that hosts the DB takes over an hour to return the full d.f. in R, and I want to find a better workflow!Capone
@michael that’s the whole point of the question lol. Feel free to post if you find a better solution!Arterio

© 2022 - 2024 — McMap. All rights reserved.