How to know the operations made to calculate the Levenshtein distance between strings?
Asked Answered
E

4

11

With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1 because "d" was inserted in the second string.

Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")? Thanks.

library(stringdist)
stringdist("abc abc","abcde acc") = 3

I would like to know that :

  • "d" was inserted

  • "e" was inserted

  • "b" was substitued into "c"

Or more simply, I would like to have the list ("d","e","c").

Extrude answered 30/6, 2019 at 20:17 Comment(1)
The Levenshtein editing paths are not necessarily unique: ofthen there are multiple equally long, and minimal sequences of edits that lead from one string to another.Original
C
11

This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.

Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package {Biostrings}, which is part of Bioconductor.

Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:

library(Biostrings)

dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

result = pairwiseAlignment(
    "abc abc", "abcde acc",
    substitutionMatrix = dist_mat,
    gapOpening = 1, gapExtension = 1
)

This won’t simply give you the list c('b', 'c', 'c'), though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:

score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a"  "b"  "c"  "-"  "-"  " "  "a"  "b"  "c"
aligned(result)

— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).

Commonwealth answered 30/6, 2019 at 22:11 Comment(1)
Unfortunately the code above requires you to specify dist_mat manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.Commonwealth
T
10

With adist(), you can retrieve the operations:

drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

ins del sub 
  2   0   1 

From ?adist:

If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a 3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type of transformation (insertions, deletions and substitutions), respectively.

Threewheeler answered 30/6, 2019 at 20:29 Comment(2)
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM" where M=match, S=substitute, D=delete, I=insertExtrude
Don't know about any handy function that will do it. However, I assume that playing around trafos will lead you to the desired results.Threewheeler
O
0

Here is the code which extracts the number of changes of each type, and then the corresponding characters for each type of operation:

source_string="12234"
target_string="02345"
lev=adist(source_string,target_string,count=T)

#number of operations of each kind
attributes(lev)$counts[,,"ins"] 
attributes(lev)$counts[,,"del"]
attributes(lev)$counts[,,"sub"]

substitution_bank=deletion_bank=insertion_bank=match_bank=NULL

changes<-strsplit(attributes(lev)$trafos, "")[[1]]

counter_source=counter_target=1
for(j in changes){
 if(j=="S") {
   substitution_bank=rbind(substitution_bank,
           cbind(strsplit(source_string,"")[[1]][counter_source], strsplit(target_string,"")[[1]][counter_target]))
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 if(j=="I") {
   insertion_bank=rbind(insertion_bank,
                           strsplit(target_string,"")[[1]][counter_target])
   counter_target=counter_target+1
 }
 if(j=="D") {
   deletion_bank=rbind(deletion_bank,
                        strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
 }
 if(j=="M") {
   match_bank=rbind(match_bank,
                           strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 

}

substitution_bank
deletion_bank
insertion_bank
match_bank

Honestly, I'm ashamed of the code -- it seems wasteful to go one character at a time. But in the presence of both insertions and deletions, I couldn't figure out how to extract the right characters... So more elegant answers are welcome!

Octastyle answered 3/9, 2021 at 6:14 Comment(0)
T
0

Investigating the comment by @tmfmnk to look at trafos led me to the following approach.

Function to return parts of string where edit_string is equal to match:

character_match = function(string, edit_string, match, drop = NA){
  # convert to array
  string = strsplit(string, "")[[1]]
  edit_string = strsplit(edit_string, "")[[1]]
  
  if(!is.na(drop)){
    edit_string = edit_string[edit_string != drop]
  }
  
  if(length(string) != length(edit_string)){
    stop("string and edit_string are different lengths")
  }
  
  output = rep("_", length(edit_string))
  is_match = edit_string == match
  output[is_match] = string[is_match]
  
  output = paste0(output, collapse = "")
  return(output)
}

Applying it to this problem:

s1 = "123456789"
s2 = "0123zz67"

out = adist(s1, s2, counts = TRUE)
edit_string = drop(attr(out, "trafos"))

Now the edit string will include the letter codes:

  • I = insert
  • M = match
  • S = substitute
  • D = delete

We can extract these with our function as follows:

# characters in s1 that match s2
character_match(s1, edit_string, "M", "I")
# "123__67__"

# characters in s1 that were substituted out
character_match(s1, edit_string, "S", "I")
# "___45____"

# characters in s1 that were deleted
character_match(s1, edit_string, "D", "I")
# "_______89"

# characters in s2 that match s1
character_match(s2, edit_string, "M", "D")
# "_123__67"

# characters in s2 that were substituted in
character_match(s2, edit_string, "S", "D")
# "____zz__"

# characters in s2 that were inserted
character_match(s2, edit_string, "I", "D")
# "0_______"

From here it is easy to see which characters and position were inserted, deleted, or substituted.

Televise answered 12/7, 2023 at 1:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.