R fuzzy string match to return specific column based on matched string

Asked 12/3, 2017 at 15:45 Answered 6/6, 2017 at 5:47

Solved r merge data.table string-matching stringdist

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. So different flats have the same address. I did some research and figured out the package stringdist that can be used.

I did some work and managed to get the closest match based on distance. However I am not able to return the corresponding columns for which the address match.

Below is a sample dummy data along with code that I have created to explain the situation

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. This would help me to know with which row of data the string was matched from df2. So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2.

I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set).

Any kind of solution would help!!

Randell answered 12/3, 2017 at 15:45 Comment(3)

Not at my computer right now, but see ?which.min to wrap stringdist() from your previous question. Consider also how you want to handle ties. – Headcheese 12/3, 2017 at 17:19

@C8H10N4O2, Thank you for your suggestion. Yes, which.min helps to know the minimum, but in this case I want to have few corresponding columns from the matched string. Since there are duplicate addresses in the large dataset, I want to have the unique_id to be able to distinguish the rows that are matched and then I can have other needed columns merged from the large dataset basis the unique_id. – Randell 12/3, 2017 at 17:31

@C8H10N4O2, I am really hoping you would be able to suggest some solution on this. Even if we are able to return the row number of the matched string from the large dataset, it should help me then merge the required columns based on the row number. – Randell 12/3, 2017 at 17:59

You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

Headcheese answered 14/3, 2017 at 14:12 Comment(3)

Thank you very much for this solution and explanation. It really helps!! Thank you once again. – Randell 15/3, 2017 at 3:44

@Randell you're welcome, also, if you need to check for uniqueness, see ?duplicated, as in,!anyDuplicated(...) – Headcheese 15/3, 2017 at 11:12

Thank you for this it was helpful!! I was also exploring stringdistmatrix to create matrix and then take minimum distance. I had achieved and the code is working. Had created a function for this. However now I need to do matching based on area for various regions. So want to have another function over the existing function. I managed to create a function but function over function finding it difficult.....still lot to learn....I have posted this question. #42794333 Please help!! – Randell 17/3, 2017 at 15:16

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching.

You can use stringdist method="dl" (or others that might work better).

To meet your requirement of "ensuring that I have same number of rows as in df1", I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance. This was suggested by dgrtwo, the developer of fuzzyjoin. (Hopefully it'll be part of the package itself in the future.)

(I also had to make an assumption to take the max year2 in the event of distance ties.)

Code:

library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2, 
                by = c("Address1" = "Address2"), 
                mode = "left", 
                method = "dl", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(Address1, Year1) %>%
  top_n(1, -dist) %>%
  top_n(1, Year2)

Result:

# A tibble: 7 x 6
# Groups:   Address1, Year1 [7]
                                Address1 Year1                             Address2 Year2 unique_id  dist
                                   <chr> <int>                                <chr> <int>     <chr> <dbl>
1                    786, GALI NO 5, XYZ  2001                   786, GALI NO 4 XYZ  2007    000007     2
2       rambo, 45, strret 4, atlast, pqr  2002 del, 546, strret2, towards east, pqr  2009    000009    17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR  2003                  23/4, STREET 2, PQR  2010    000010    19
4                    45-B, GALI NO5, XYZ  2004                  45B, GALI NO 5, XYZ  2008    000008     2
5                 HECTIC, 99 STREET, PQR  2005                  23/4, STREET 2, PQR  2010    000010    11
6                    786, GALI NO 5, XYZ  2006                   786, GALI NO 4 XYZ  2007    000007     2
7       rambo, 45, strret 4, atlast, pqr  2007 del, 546, strret2, towards east, pqr  2009    000009    17

Anatase answered 6/6, 2017 at 5:47 Comment(0)

Recommended topics

Hot tags