R function for a function to be repeated based on column values
Asked Answered
B

1

1

I have a large data set for which I need to do string matching. I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset. I am pasting my sample data and code.

SAMPLE DATA

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

CODE

fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){

  dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)

  min.name<-apply(dist.name, 1, min)

  match.s1.s2<-NULL
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
  }
  output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
  return(output)
}


match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")

This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas. Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2. I want to update my function so that

  1. string matching is done for each area and the output has that area code
  2. output is returned for each region consolidated for all area codes in that region.

I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors. I am sure there is a way to do this but not able to get it. Hope I can have some suggestions.

Boynton answered 14/3, 2017 at 18:30 Comment(5)
Could you share the code you have tried so far? Data1 and Data2 represent how your data looks at the region level?Cussed
@JuanBosco, Yes dat1 and dat2 represent how the data looks like at regional level. Like this I have multiple data files for 10 regions. I was calling this current function 10 times. But now I need to do the check for each area and get the output for the region. Each region can have 10-40 different areas and in total I have 190 unique areas.Boynton
I see your function takes pairs of data sets and performs operations with them. What do you mean with checking fo each area? Each area in the first data set, in the second data set, all cominations of areas? In short, what would be your expected output? I think you can achieve some possible outputs using your current functions, but I'm not completely sure of what you need.Cussed
@JuanBosco, Yes I mean between areas present in both data sets. So for example, area 10 we have 2 addresses in data1 and 3 in addresses in data2, so the matching should happen for area 10 should happen only between these records. 2 address of area 10 from data1 have 3 choices to match with area 10 in data2. I want this area variable to be a parameter as currently we want to match between area and tomorrow we may want to test based on some other variable say Year. Hence thinking of having another function over the current function. Hope you understand and would be able help!! Thank you!!Boynton
did this answer help?Lopes
L
0

While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin solution based on my answer to your previous question.

It looks for the best stringdist match for Address and the AreaCode must match exactly (==). I also specified year2 had to be >= year1, just for demonstration.

Finally, I used dplyr::group_by and dplyr::top_n to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2). You can also use slice_min which replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)

Data:

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

Solution:

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
  
  # Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
  ignore_case = FALSE
  method = "dl"
  max_dist = 99
  distance_col = "dist"
  
  if (ignore_case) {
    v1 <- stringr::str_to_lower(v1)
    v2 <- stringr::str_to_lower(v2)
  }
  
  # shortcut for Levenshtein-like methods: if the difference in
  # string length is greater than the maximum string distance, the
  # edit distance must be at least that large
  
  # length is much faster to compute than string distance
  if (method %in% c("osa", "lv", "dl")) {
    length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
    include <- length_diff <= max_dist
    
    dists <- rep(NA, length(v1))
    
    dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
  } else {
    # have to compute them all
    dists <- stringdist::stringdist(v1, v2, method = method)
  }
  ret <- dplyr::data_frame(include = (dists <= max_dist))
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- dists
  }
  ret
}

# Finally, call fuzzy_join
fuzzy_join(data1, data2, 
           by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")), 
           match_fun = list(match_fun_stringdist, `==`, `<=`),
           mode = "left"
           ) %>%
  group_by(Address1, Year1, AREACODE) %>%
  top_n(1, -Address1.dist) %>%
  top_n(1, Year2) %>%
  select(unique_id, Address1.dist, everything())
Lopes answered 6/6, 2017 at 6:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.