Automatically extracting strings with mismatched spellings from a column and replacing them in R [closed]

R

2

6

I have a huge dataset which is similar to the columns posted below

NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")

As you can see, in the Region column, the region "Pune" is spelled in two different ways- i.e "Pune" and "Poona".

Similarly, "Orissa" is spelled as "Orissa" and "Orisa".

I have multiple regions which are actually the same but are spelled in different ways. This will cause problems when I analyze the data.

I want to automatically be able to obtain a list of these mismatched spellings with the help of R.
I would also like to replace the spellings with the correct spellings automatically.

Rapture answered 24/7, 2018 at 6:28 Comment(6)

Im failing to understand why this question is on hold? – Docent 25/7, 2018 at 6:32

The question was voted as "Off-topic" by a few people, which is why it is on hold. I'm not sure why they think this is off-topic. It is a question about text analysis in R. – Rapture 25/7, 2018 at 6:47

""The users who voted to close gave this specific reason:"Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic ". – Datum 26/7, 2018 at 17:18

There are two actions you can take to improve this question so it's not off-topic: 1) remove "Is there a package I could use for this?" and 2) show what things you have already tried. For example why don't you see how far you can get regexp, prefix matching, and the R builtin function adist – Datum 26/7, 2018 at 17:18

@Hardikgupta and others: this question is too broad. It's merely a list of requirement. Lists of requirements are not considered good fits for the platform. If, for instance, you'd have the start of an algorithm that would group words together and you were stuck somewhere, that would totally get reopened (provided you produce the code and the error message). In its current state, there are a virtually infinite ways of doing this. – Illuviation 26/7, 2018 at 19:1

@Rapture Merely being about text analysis in R does not make a question on topic. "Hey how do I analyze lorem ipsum in R" is also a question that is about text analysis in R. It's really not a good or on-topic question just for that. You might want to read more in the help center about what "on-topic" means. It's not only "generally touches the subject of programming". – Illuviation 26/7, 2018 at 19:5

L

9

Misspelling is hard to detect, event more when working with names.

I'll suggest using some string distance to detect how close two words are. You can easily do this with tidystringdist, which allows to get all the combinations from a vector, and then to perform all available string distance methods from stringdist:

Region <- c("Pune", "Orissa", "Orisa", "Poone")

library(tidystringdist)
library(magrittr)

tidy_comb_all(Region) %>%
  tidy_stringdist()
#> # A tibble: 6 x 12
#>   V1     V2      osa    lv    dl hamming   lcs qgram cosine jaccard     jw
#> * <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1 Pune   Oris…     6     6     6     Inf    10    10 1          1   1     
#> 2 Pune   Orisa     5     5     5     Inf     9     9 1          1   1     
#> 3 Pune   Poone     2     2     2     Inf     3     3 0.433      0.4 0.217 
#> 4 Orissa Orisa     1     1     1     Inf     1     1 0.0513     0   0.0556
#> 5 Orissa Poone     6     6     6     Inf    11    11 1          1   1     
#> 6 Orisa  Poone     5     5     5       5    10    10 1          1   1     
#> # ... with 1 more variable: soundex <dbl>

Created on 2018-07-24 by the reprex package (v0.2.0).

As you can see here, Pune and Poone have an osa, lv and dl distance of 2, and Orisa / Orissa a distance of 1, suggesting their spelling is very close.

When you have identified these, you can do the replacement.

Leaf answered 24/7, 2018 at 6:49 Comment(5)

This worked perfectly, thanks. Since there are many mismatched string spellings, would there be an easy way to replace them all instead of manually doing it via the plyr or gsub packages? – Rapture 24/7, 2018 at 7:1

This is complicated: for example, if I compare "Orissa" and "Orisa" and find a stringdist of 1, it's hard to decide which one is the correct spelling. I mean, you know which one is the good one, but your computer doesn't — unless you have a dictionary correctly spelled words? – Leaf 24/7, 2018 at 7:19

Unfortunately, I don't. What if I made a dictionary with the correct spellings? Would there be a way to automize the process then? – Rapture 24/7, 2018 at 7:38

Yes, in that case you could say "every string that deviates X much of the dictionary has to be replaced" – Leaf 24/7, 2018 at 13:34

which package would help me to do this? – Rapture 24/7, 2018 at 13:40

M

10

I believe that you should use a phonetic code to determine which spellings are close to which.

A good choice is the soundex algorithm, implemented in several R packages. I will use package stringdist.

library(stringdist)

Region <- c("Pune", "Orissa", "Orisa", "Poone")
phonetic(Region)
#[1] "P500" "O620" "O620" "P500"

As you can see, Region[1] and Region[4] have the same soundex code. And the same for Region[2] and Region[3].

Metalinguistics answered 24/7, 2018 at 6:45 Comment(5)

this worked but many regions were still mismatched! – Rapture 24/7, 2018 at 7:2

@Rapture This worked with the example data you posted. I cannot tell how it will perform with other data. (And no algorithm is perfect, of course. Soundex was developed for the English language, other languages have different sound/spelling rules.) – Metalinguistics 24/7, 2018 at 7:5

The region names are Indian. This could be why it did not work. – Rapture 24/7, 2018 at 7:15

@Rapture Can you post examples where it failed? – Metalinguistics 24/7, 2018 at 8:38

One example would be of "Udisa" and "Odissa" in my data set. Both the words were the same region spelled in different ways. The codes for these were O320 and U320. However, I have to say this package is much more reliable in most cases as compared to the tidystringdist package. It just wasn't the right fit for this data! @Rui Barradas – Rapture 26/7, 2018 at 16:0

L

9