I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect
) I can easily use the stringdist
function from the stringdist
package.
But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.
There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr
. I'm happy to try and map the stringdist
function to the Jaro-Winkler sql
code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.
I had hoped that SQL translation in the dbplyr
documentation might help, but I don't think so.