I have a large data set for which I need to do string matching. I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset. I am pasting my sample data and code.
SAMPLE DATA
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
CODE
fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){
dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name))
{
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
}
output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
return(output)
}
match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")
This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas. Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2. I want to update my function so that
- string matching is done for each area and the output has that area code
- output is returned for each region consolidated for all area codes in that region.
I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors. I am sure there is a way to do this but not able to get it. Hope I can have some suggestions.