How to query number of Ids in batches in R
Asked Answered
T

4

5

I have below mentioned dataframe in R.

ID       Amount     Date
IK-1     100        2020-01-01
IK-2     110        2020-01-02
IK-3     120        2020-01-03
IK-4     109        2020-01-03
IK-5     104        2020-01-03

I'm using ID to fetch some details from MySQL using the following code.

library(RMySQL)

conn<- connection

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
                                      collapse = ', '),") 
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"

The above code is working fine, If i pass ~1000 ID but it throws R termination error when passing 10000 or more ID at a time.

How can I create a loop and pass the Id in batched to get the one final output for 10000 ID.

Error Message:

Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows
Terrazas answered 3/2, 2020 at 8:22 Comment(6)
Just guessing, can we not join the r dataframe, too? Replace your "where ... " with JOIN dataframe[, "ID", drop = FALSE] x ON x.ID = e.role_id ?Clobber
@zx8754: I tried, It didn't work.Terrazas
@Vector JX could you please include the error msg.Allix
@A.Suliman: After running for around 10min the R got terminated as session expire, without showing any error message. The code is working fine when passing around 1000 ID.Terrazas
@A.Suliman: sometime getting error message like Warning message: In dbFetch(rs, n = n, ...) : error while fetching rowsTerrazas
@VectorJX I hope this MySQL IN condition limit can help. also see this one #9875188Allix
P
2
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)

# create the initial dataframe
df <- tribble(
    ~ID,       ~Amount,     ~Date
    , "IK-1"    , 100       , 2020-01-01
    , "IK-2"    , 110       , 2020-01-02
    , "IK-3"    , 120       , 2020-01-03
    , "IK-4"    , 109       , 2020-01-03
    , "IK-5"    , 104       , 2020-01-03
)

# first helper function
createIDBatchVector <- function(x, batchSize){
    paste0(
        "'"
        , sapply(
            split(x, ceiling(seq_along(x) / batchSize))
            , paste
            , collapse = "','"
        )
        , "'"
    )
}

# second helper function
createQueries <- function(IDbatches){
    paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
    LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
    LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,") 
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}

# ------------------------------------------------------------------

# and now the actual script

# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)

# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'" 

# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)

cat(queries) # use cat to show what they look like

# it looks like this:

# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');

# and now the loop
df_final <- data.frame() # initialize a dataframe

conn <- connection # open a connection
for (query in queries){ # iterate over the queries
    df_final <- rbind(df_final, dbGetQuery(conn,query))
}

# And here the connection should be closed. (I don't know the function call for this.)
Petropavlovsk answered 12/2, 2020 at 16:16 Comment(0)
M
5

Pass the ID's data frame into a temp table before your SQL query and then use that to inner join on the ID's that you are using, this way you can avoid looping. All you gotta do is use dbWriteTable and set the parameter temporary = TRUE when calling it.

EX:

library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user', 
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID 
and e.Parameters in
       ('Section1',
       'Section2','Section3',
       'Section4');")

This should improve performance on your code as well as you won't need to loop in R anymore with the where statement. Let me know if it isn't working properly.

Myron answered 7/2, 2020 at 16:35 Comment(3)
I'm getting error could not run statement: CREATE command denied to userTerrazas
Is there any way for your System admin to grant you create access to temporary tables? You probably have read-only access to the DB. There is a difference in permission to CREATE and CREATE TEMPORARY TABLE. I just want to verify you have temp table permissions.Myron
No, I don't have write access.Terrazas
P
2
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)

# create the initial dataframe
df <- tribble(
    ~ID,       ~Amount,     ~Date
    , "IK-1"    , 100       , 2020-01-01
    , "IK-2"    , 110       , 2020-01-02
    , "IK-3"    , 120       , 2020-01-03
    , "IK-4"    , 109       , 2020-01-03
    , "IK-5"    , 104       , 2020-01-03
)

# first helper function
createIDBatchVector <- function(x, batchSize){
    paste0(
        "'"
        , sapply(
            split(x, ceiling(seq_along(x) / batchSize))
            , paste
            , collapse = "','"
        )
        , "'"
    )
}

# second helper function
createQueries <- function(IDbatches){
    paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
    LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
    LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,") 
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}

# ------------------------------------------------------------------

# and now the actual script

# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)

# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'" 

# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)

cat(queries) # use cat to show what they look like

# it looks like this:

# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#  
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
#     LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
#     LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5') 
# AND e.Parameters in ('Section1','Section2','Section3','Section4');

# and now the loop
df_final <- data.frame() # initialize a dataframe

conn <- connection # open a connection
for (query in queries){ # iterate over the queries
    df_final <- rbind(df_final, dbGetQuery(conn,query))
}

# And here the connection should be closed. (I don't know the function call for this.)
Petropavlovsk answered 12/2, 2020 at 16:16 Comment(0)
B
0

As the links by @A.Suliman suggest this is most likely due to having a large number of values in you IN-clause. Here are some solutions to try:

Batch processing

I am a fan of using modulo to batch process. This assumes the ID values you are batching across are numeric:

num_batches = 100
output_list = list()

for(i in 1:num_batches){
    this_subset = filter(dataframe, ID %% num_batches == (i-1))

    # subsequent processing using this_subset

    output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)

In your case it looks like the ID takes the form XX-123 (two characters, a hyphen, followed by some numbers). You can convert this into a number using: just_number_part = substr(ID, 4, nchar(ID)).

Temporary file writing

If you were to write dataframe from R to sql then you would not need such a large IN-clause and could use a join instead. The dbplyr package includes a function copy_to that can be used to write temporary tables to the database.

This would look something like:

library(RMySQL)
library(dbplyr)

conn<- connection

copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql

query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');")

res1 <- dbGetQuery(conn,query)

For reference I recommend the tidyverse documentation. You might also find this question on writing using copy_to helpful for debugging.

Increasing the timeout delay

When there are a lot of values in an IN-clause the query becomes much slower to execute as an IN-clause is essentially translated to a sequence of OR statements.

According to this link you can change the timeout options for MySQL by:

  • Edit your my.cnf (MySQL config file)
  • Add the timeout configuration and adjust it to fit your server.
    • wait_timeout = 28800
    • interactive_timeout = 28800
  • Restart MySQL
Bono answered 6/2, 2020 at 20:54 Comment(2)
Getting error could not run statement: CREATE command denied to user.Terrazas
I assume this is when you attempt copy_to. It appears that you do not have permission to create new tables in the MySQL database. So the second approach I suggested will not work in your configurationBono
S
0

Maybe just a try...

Following an above comment, there might be a size limit in the MySQL IN (...) condition. Maybe you could bypass it by splitting the entire list of dataframe$IDs in sublist and re-writing your query with a condition like :

WHERE c.ID IN sublist#1
OR c.ID IN sublist#2
OR c.ID IN sublist#3
...

instead of an unique c.ID IN list ?

Let's say that we make sublists with a max length of 1000, it could give :

sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

and then, you could build a string like "OR c.ID IN (...) OR c.ID IN (...) OR c.ID IN (...) ...

Inserted in your code, that would give :

library(RMySQL)
conn<- connection
sublists <- split(dataframe$ID, ceiling(seq_along(dataframe$ID)/1000))

query <- paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
           where 1 = 1 AND (", # to get rid of the "where"
       paste(lapply(sublists, 
                    FUN = function(x){
                      paste0("OR c.ID IN (",  paste(shQuote(x, type = "sh"), collapse = ', '), ")")
                    }), 
             collapse = "\n"), ")
and e.Parameters in
           ('Section1',
           'Section2','Section3',
           'Section4');") %>% cat

res1 <- dbGetQuery(conn,query)

res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"
Spoilfive answered 13/2, 2020 at 13:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.