Investigating the comment by @tmfmnk to look at trafos
led me to the following approach.
Function to return parts of string
where edit_string
is equal to match
:
character_match = function(string, edit_string, match, drop = NA){
# convert to array
string = strsplit(string, "")[[1]]
edit_string = strsplit(edit_string, "")[[1]]
if(!is.na(drop)){
edit_string = edit_string[edit_string != drop]
}
if(length(string) != length(edit_string)){
stop("string and edit_string are different lengths")
}
output = rep("_", length(edit_string))
is_match = edit_string == match
output[is_match] = string[is_match]
output = paste0(output, collapse = "")
return(output)
}
Applying it to this problem:
s1 = "123456789"
s2 = "0123zz67"
out = adist(s1, s2, counts = TRUE)
edit_string = drop(attr(out, "trafos"))
Now the edit string will include the letter codes:
- I = insert
- M = match
- S = substitute
- D = delete
We can extract these with our function as follows:
# characters in s1 that match s2
character_match(s1, edit_string, "M", "I")
# "123__67__"
# characters in s1 that were substituted out
character_match(s1, edit_string, "S", "I")
# "___45____"
# characters in s1 that were deleted
character_match(s1, edit_string, "D", "I")
# "_______89"
# characters in s2 that match s1
character_match(s2, edit_string, "M", "D")
# "_123__67"
# characters in s2 that were substituted in
character_match(s2, edit_string, "S", "D")
# "____zz__"
# characters in s2 that were inserted
character_match(s2, edit_string, "I", "D")
# "0_______"
From here it is easy to see which characters and position were inserted, deleted, or substituted.