Joining two datasets using fuzzy logic
Asked Answered
W

1

12

I’m trying to do a fuzzy logic join in R between two datasets:

  • first data set has the name of a location and a column called config
  • second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set.

I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for example if we looked at these two data sets, I'd like the name OPAL to join to the OPALAS, and SAUSALITO Y to join to SAUSALITO.

Dataset1:    
     Name           Config
     ALTO D         BB
     CONTRA         ST
     EIGHT A        DD
     OPALAS         BB
     SAUSALITO Y    AA
     SOLANO J       ST

Dataset2:    
    Name       Age     Rank
    ALTO D     50      2
    ALTO D     20      6
    CONTRA     10      10
    CONTRA     15      15
    EIGHTH     18      21
    OPAL       19      4
    SAUSALITO  2       12
    SOLANO     34      43

Code for Summarizing of Dataset2

Data2a <- summaryBy(Age ~ Name,FUN=c(mean), data=Data2,na.rm=TRUE)
Data2b <- summaryBy(Rank ~ Name,FUN=c(sum), data=Data2,na.rm=TRUE)
Data2 <- data.frame(Data2a$Name, Data2a$Age.mean, Data2b$Rank.sum)

Desired Outcome:
    Name        Config  Age   Rank
    ALTO D      BB      35    8
    CONTRA      ST      12.5  25
    EIGHT A     DD      18    21
    OPALAS      BB      19    4
    SAUSALITO Y AA      12    5
    SOLANO J    ST      34    43
Wilburn answered 4/6, 2017 at 23:36 Comment(1)
check out the fuzzyjoin package.Rambort
W
12

I was able to join the two datasets, using the fuzzyjoin package:

library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2,
     by ="Name", distance_col = NULL)
Wilburn answered 5/6, 2017 at 1:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.