How do I use a Trie for spell checking
Asked Answered
D

3

14

I have a trie that I've built from a dictionary of words. I want to use this for spell checking( and suggest closest matches in the dictionary , maybe for a given number of edits x). I'm thinking I'd use levenshtein distance between the target word and words in my dictionary, but is there a smart way to traverse the trie without actually running the edit distance logic over each word separately? How should I do the traversal and the edit distance matching?

For e.g, if I have words MAN, MANE, I should be able to reuse the edit distance computation on MAN in MANE. Otherwise the Trie wouldnt serve any purpose

Doubt answered 26/1, 2014 at 17:18 Comment(4)
"man/mane" is almost trivial, try "mane/bane"Inverter
I don't think these approaches fit together. You need to apply edit distance to every word in the dictionary to be able to make a suggestion IMHO.Feuar
true, but how can I overlap edit distance computations to avoid recomputation of the same distanceDoubt
OK here's a thought: Run edit distance against every word and prune it on words that exceeds a determinate number of edits (this will make no need to search the whole words in most cases). How to improve it? Since edit distance is calculated backwards, a suffix tree may perform better. When you exceed some edit threshold, you can discard the whole branch. (this is just a crazy idea :)Feuar
E
2

Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.

You can then stop examining any nodes if every element in the array is greater than your target distance.

For example, with a trie containing MAN and MANE and an input BANE:

Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]

The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.

Engleman answered 26/1, 2014 at 19:7 Comment(0)
D
6

I think you should instead give a try to bk-trees; it's a data structure that fits well spell-checking as it will allow you to compute efficiently the edit distance with the words of your dictionary.

This link gives a good insight into BK-trees applied to spell-checking

Delphinedelphinia answered 26/1, 2014 at 17:33 Comment(0)
E
2

Try computing for each tree node an array A where A[x] the smallest edit distance to be at that position in the trie after matching the first x letters of the target word.

You can then stop examining any nodes if every element in the array is greater than your target distance.

For example, with a trie containing MAN and MANE and an input BANE:

Node 0 representing '', A=[0,1,2,3,4]
Node 1 representing 'M', A=[1,1,2,3,4]
Node 2 representing 'MA', A=[2,1,1,2,3]
Node 3 representing 'MAN' A=[3,2,2,1,2]
Node 4 representing 'MANE' A=[4,3,2,2,1]

The smallest value for A[end] is 1 reached with the word 'MANE' so this is the best match.

Engleman answered 26/1, 2014 at 19:7 Comment(0)
V
1

There is a smart way to get every element that is not quite a Levenstein distance since the following algorithm does not incorporate transpositions.

Assuming we have the Tree structure, we can implement a recursive search of the tree. Your recursive search assumes we start with a cost-row representing the cost of deleting every letter. As we recursively search the tree, the information we have is

  • You are at node n, that has been indexed in your Trie structure by letter l.
  • You are considering a distance from a word w
  • Your current path assumes a previous cost-row up to this point, we wish to update this to form a new cost row for this node n.

We want to update our cost-row at the letter you are considering in accordance with 4 situations; l is the next letter in the word (cost row remains the same), the letter needs to be inserted (new cost +1), a letter has been deleted (cost of previous step +1), and the letter replaces a previous word (new cost +1).

The cost of proceeding down this path on your tree is the minimum of these costs. At this point, if your at a point in the Trie structure defining a word, append it to a list, and then recursively search all children for more words assuming the current cost is within a defined maximum cost. An implementation in Python can be found in another post:

https://mcmap.net/q/901667/-in-spell-checker-how-to-get-the-word-that-are-3-edits-away-norvig

I also have this in C for piping. Since the algorithm is pretty fast even for high edit distances (< len of word) one may use a fast efficient implementation of the Levenstein distance to correct this method.

Vickeyvicki answered 15/9, 2020 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.