In spell checker how to get the word that are 3 edits away(norvig)
Asked Answered
N

1

0

I have been trying to use spell corrector for my database table to correct the address from one table, for which I have used the reference of http://norvig.com/spell-correct.html Using the Address_mast table as a collection of strings I'm trying to correct and update the corrected string in "customer_master"

Address_mast

ID        Address
1    sonal plaza,harley road,sw-309012
2    rose apartment,kell road, juniper, la-293889
3    plot 16, queen's tower, subbden - 399081
4    cognizant plaza, abs road, ziggar - 500234

now from the reference code it has been done only for those words which are "two edits away from word".but I'm trying to do it for 3 or till 4 and at the same time trying to update those corrected words to other table.here is the table which contains misspell words and is to be updated with corrected words

Customer_master

Address_1

josely apartmt,kell road, juneeper, la-293889
zoonal plaza, harli road,sw-309012
plot 16, queen's tower, subbden - 399081
cognejantt pluza, abs road, triggar - 500234

here is what I have tried

import re
import pyodbc
import numpy as np
from collections import Counter

cnxn = pyodbc.connect('DRIVER={SQLServer};SERVER=localhost;DATABASE=DBM;UID=ADMIN;PWD=s@123;autocommit=True')
cursor = cnxn.cursor()
cursor.execute("select address as data  from Address_mast")
data=[]
for row in cursor.fetchall():

    data.append(row[0]) 

data = np.array(data)

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('data').read()))
def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or known(edits3(word)) or known(edits4(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def edits3(word): 

    return (e3 for e2 in edits2(word) for e3 in edits1(e2))

def edits4(word): 

    return (e4 for e3 in edits3(word) for e4 in edits1(e3))


sqlstr = ""
j=0
k=0
for i in data:
    sqlstr=" update customer_master set Address='"+correction(data)+"' where data="+correction(data)
    cursor.execute(sqlstr)

    j=j+1
    k=k+cursor.rowcount
cnxn.commit()
cursor.close()
cnxn.close()
print(str(k) +" Records Completed")

from this I m unable to get proper output, any suggestion on what changes shuld be made..Thanks in advance

Neckerchief answered 5/6, 2017 at 11:36 Comment(4)
You can use Fuzzy Lookup component's(of SSIS) API in C# or other programming language to find the near match using inbuilt wayKucera
It seems you forgot to include your new edits3 and edits4 functions in candidates(). Or in what way is your output improper?Malatya
@RachelAmbler yes the customer_master table contains some address with some wrongly spelled words(because that column was derived from text of other Indian regional language to English language). So I'm trying to apply spell-corrector to rectify my wrongly translated text and replace them with the corrected one.. For which I'm taking address_mast data as my reference or training data which consist of similar or correct words.Neckerchief
My question still stands: What exactly is missing for "proper output"? You fixed the bug which kept the algorithm from producing variants ad LD 3 and 4, so what is wrong still? Be very specific: What is produced, and how does it differ from what you want?Malatya
K
0

The above answers are ok, but there is a faster solution than checking the exponentially increasing set of strings of edit distance k. Suppose we had a data structure that stored the set of all words in a tree structure. This is useful because we know, for example, that we need not search paths in which there are no words. This is both memory efficient and computationally efficient.

Suppose we have a vocabulary stored in a set, dict, or a ideally, a collections.Counter object, then we can set up the data structure as follows:

class VocabTreeNode:
    def __init__(self):
        self.children = {}
        self.word = None
        
    def build(self, vocab):
        for w in vocab:
            self.insert(w)

    def insert( self, word):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = VocabTreeNode()
            node = node.children[letter]
        node.word = word

To search only the set of elements of edit distance k from the word, we may endow this structure with a recursive search.

    def search(self, word, maxCost):
        currentRow = range( len(word) + 1 )    
        results = []
        for letter in self.children:
            self.searchRecursive(self.children[letter], letter, 
                                 word, currentRow, results, 
                                 maxCost)   
        return results
            
    def searchRecursive(self, node, letter, word, previousRow, 
                        results, maxCost):
        columns = len( word ) + 1
        currentRow = [ previousRow[0] + 1 ]
        for column in range( 1, columns ):
            insertCost = currentRow[column - 1] + 1
            deleteCost = previousRow[column] + 1
            if word[column - 1] != letter:
                replaceCost = previousRow[ column - 1 ] + 1
            else:                
                replaceCost = previousRow[ column - 1 ]
            currentRow.append( min( insertCost, deleteCost, replaceCost ) )
    
        if currentRow[-1] <= maxCost and node.word != None:
            results.append( (node.word, currentRow[-1] ) )
        if min( currentRow ) <= maxCost:
            for next_letter in node.children:
                self.searchRecursive( node.children[next_letter], next_letter, word,
                                      currentRow, results, maxCost)

There is just one problem that I'm not sure how to overcome; transpositions are not valid as paths, so i'm not sure how to incorporate transpositions as edit distance 1 without a somewhat complicated hack.

My corpus of words was 97722 (the set of words in almost any linux distro).

sleep(1)
start = time()

for i in range(100):
    x = V.search('elephant',3)
    
print(time()- start)

>>> 17.5 

Which equates to edit distance 3 calculations for this word every 0.175 seconds. Edit distance 4 was able to be done in .377 seconds, whereas consecutive edit distances using the edits1 will quickly cause your system to run out of memory.

With the caveat of not easily handling transpositions, this is a fast effective way of implementing a Norvig-type algorithm for high edit distances.

Ketchan answered 9/7, 2020 at 20:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.