Fuzzy Match Across Columns in R
Asked Answered
L

1

7

How can I measure the degree to which names are similar in r? In other words, the degree to which a fuzzy match can be made.

For example, I am working with a data frame that looks like this:

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")

df1 <- data.frame(Name.1, Name.2)
df1
            Name.1             Name.2
1         gonzalez gonzalezsoldevilla
2 wassermanschultz            schultz
3   athanasopoulos    anthanasopoulos
4           armato             strain

It is clear from the data that rows 1 and 2 are similar enough to be confident that the name is the same. Row 3 is the same name even though it is misspelled and the fourth row is completely different.

As an output, I would like to create a third column that describes the degree of similarity between the names or returns a boolean of some kind to indicate a fuzzy match can be made.

Lapoint answered 12/7, 2020 at 8:0 Comment(0)
D
8

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings.

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#>             Name.1             Name.2   similar
#> 1         gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz            schultz 0.4375000
#> 3   athanasopoulos    anthanasopoulos 0.9333333
#> 4           armato             strain 0.1666667
Damper answered 12/7, 2020 at 8:22 Comment(3)
This is fabulous! Thank you so much for this package! I appreciate the help.Lapoint
@Sharif Amlani you are welcome. You should thank the author of the package.Damper
Excellent, I'll shoot him/her an email!Lapoint

© 2022 - 2024 — McMap. All rights reserved.