Finding closest neighbour using optimized Levenshtein Algorithm
Asked Answered
R

1

3

I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance.

The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn) to O(kn), m and n being the lengths of the strings. I looked up the algorithm, but I couldn't really figure out how to implement it. I was hoping to get some leads on that here.

The optimization is #4 under "Possible Improvements".

The part that confused me was the one that said that we only need to compute a diagonal stripe of width 2k+1, centered on the main diagonal (the main diagonal is defined as coordinates (i,i)).

If someone could offer some help/insight, I would really appreciate it. If needed, I can post the complete description of the algorithm in the book as an answer here.

Reichert answered 7/7, 2010 at 13:41 Comment(2)
Where did you get "centered on the main diagonal ... defined as coordinates (i, i)" from? The quoted article doesn't say that. The stripe needs to be centred on the TRAILING diagonal (which ends in (m, n)) ... the whole idea is to calculate d[m][n] ... d[i][j] is the Lev. distance between the first i chars in s and the first j chars in t.Mutation
@John: I got that in the book that the Wikipedia article quotes. It says specifically that the last element need not be d[m][n], but it will be [i][i]. While this will not be the correct Levenshtein distance of the two words, it will tell us whether the two words are closer than our bound k or not, which is why it's still useful.Reichert
C
3

I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}

Added to explain trie-search:

// definition of trie-node:
struct TNode {
  TNode* pa[128]; // for each possible character, pointer to subnode
};

// simple trie-walk of a node
// key is the input word, answer is the output word,
// i is the character position, and hdis is the hamming distance.
void walk(TNode* p, char key[], char answer[], int i, int hdis){
  // If this is the end of a word in the trie, it is marked as
  // having something non-null under the '\0' entry of the trie.
  if (p->pa[0] != null){
    if (key[i] == '\0') printf("answer = %s, hdis = %d\n", answer, hdis);
  }
  // for every actual subnode of the trie
  for(char c = 1; c < 128; c++){
    // if it is a real subnode
    if (p->pa[c] != null){
      // keep track of the answer word represented by the trie
      answer[i] = c; answer[i+1] = '\0';
      // and walk that subnode
      // If the answer disagrees with the key, increment the hamming distance
      walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1));
    }
  }
}
// Note: you have to edit this to handle short keys.
// Simplest is to just append a lot of '\0' bytes to the key.

Now, to limit it to a budget, just refuse to descend if hdis is too large.

Crellen answered 7/7, 2010 at 15:47 Comment(7)
without DP or memoization, the running time of this algorithm doesn't appear to be O(kn)Georgianngeorgianna
In the average case, this should still be faster than O(mn) because of the bounds on k. However, in the worst case this will be slower than O(mn).Reichert
@Dharmesh: I confess I've used it more in the context of spelling-correction, where it is conjoined with a trie-walk of a dictionary to find nearest-matching words.Crellen
@Mike: I also have a trie walkthrough to find if an exact match exists. I'm using this in conjunction with another algorithm to find the nearest match if an exact match does not exist. After profiling my code, I found that the vast majority of the time was being spent calculating the Levenshtein distance, which is why I'm trying to optimize that bit right now.Reichert
@Dharmesh: Then my approach might help you, because I combine the distance calculation into the trie-walk. It's not a separate computation. I've posted this code elsewhere on SO: #2919271 (It's been about 30 years, so it may be a bit rusty.)Crellen
@Mike: I read your answer for the other question. The thing is, if the word starts with a letter that's incorrect then a trie is useless. I haven't had time to go through your code yet, so you might have accounted for this somehow. Thanks anyway!Reichert
@Dharmesh: No. Here's a simple way to think about it. You just walk the entire trie. Each time you descend to a subnode, you're committing to a letter in the dictionary. As you do that, keep track of the cost. If the letter is the same as the corresponding letter in the key, then the cost doesn't increase, otherwise it goes up by 1. So, when you get to each leaf, the cost = hamming distance. Then use the budget to prune the search, so you don't reach any leaves with hamming distance greater than the budget. Then generalize to more interesting changes.Crellen

© 2022 - 2024 — McMap. All rights reserved.