How to use custom SQL function in dbplyr?

Asked 2/6, 2018 at 22:57 Answered 4/2, 2022 at 15:34

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.

But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.

There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.

I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.

Roberts answered 2/6, 2018 at 22:57 Comment(2)

Could you please provide a Minimal, Complete, Verifiable example with the sample data code baked into it? stackoverflow.com/help/mcve – Folliculin 2/6, 2018 at 23:31

Can you just make a SQL call directly and use dplyr after you being your data in? – Carraway 2/6, 2018 at 23:46

You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:

union_all = function(table_a,table_b, list_of_columns){
  # extract database connection
  connection = table_a$src$con

  sql_query = build_sql(con = connection,
                      sql_render(table_a),
                      "\nUNION ALL\n",
                      sql_render(table_b)
  )

  return(tbl(connection, sql(sql_query)))
}

unioned_table = union_all(table_1, table_2, c("who", "where", "when"))

Two key commands here are:

sql_render, which takes a dbplyr table and returns the SQL code that produces it
build_sql, which assembles a query from strings.

You have choices for your execution command:

tbl(connection, sql(sql_query)) will return the resulting table
dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)

Nmr answered 18/9, 2018 at 20:40 Comment(2)

I have tried out your solution and it works. But I am bothered with one thing: the first time dbplyr connects to a database via table_a <-tbl(con, "table_a") it needs to do a dummy query, i.e. "select * from table_a where 0 = 1". If I now use R object table_a in the sql_render/build_sql, it will do the same type of WHERE 0 = 1 query again. Is there a way to prevent dbplyr to do that and instead reuse the meta-data from the first tbl(...) call? – Smokestack 17/1, 2019 at 14:39

I have seen dbplyr produce SQL queries containing WHERE 0 = 1 but they have never effected my custom SQL functions. It might be due to how you connect to the SQL tables, or because you call render_sql(tbl(con, "table_a")) instead of first assigning tbl_a = tbl(con, "table_a") before calling render_sql(tbl_a). But I can not tell from the details you have provided. Consider opening a new question and posting a link to it here - I can take a more detailed look. – Nmr 18/1, 2019 at 21:20

You can use sql() which runs whatever raw SQL you provide.

Example

Here the lubridate equivalent doesn't work on a database backend.

So instead I place custom SQL code sql("EXTRACT(WEEK FROM meeting_date)") inside sql(), like so:

your_dbplyr_object %>%
  mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))

One other thing, this is the top search result when searching for how to render the raw sql using dbplyr. In case anyone arrived here from that search, the answer is to pipe to dbplyr::sql_render(). From the docs:

sql_build() creates a select_query S3 object, that is rendered to a SQL string by sql_render()

Repulse answered 4/2, 2022 at 15:34 Comment(0)

Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function that's available in SQL-land.

This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the same code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.

Bernadinebernadotte answered 12/1, 2022 at 23:13 Comment(0)

Example

Recommended topics

Hot tags