I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.
I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.
Below is a sample dummy data along with code that I have created to explain the situation
library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]
fn_match = function(str, strVec, n){
strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}
df1[!is.na(Address1)
, address_match :=
fn_match(Address1, df2$Address2,3)
]
This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.
I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).
Any kind of solution would help!!
?which.min
to wrapstringdist()
from your previous question. Consider also how you want to handle ties. – Headcheese