I have a list of strings in Java containing first name of a person with dissimilar spellings (not entirely different). For example, John may be spelled as Jon, Jawn, Jaun etc. How should I retrieve the most appropriate string in this list. If anyone can suggest a method how to use Soundex in this case, it shall be of great help.
You have use approximate string matching algorithm , There are several strategies to implement this . Blur is a Trie-based Java implementation of approximate string matching based on the Levenshtein word distance.
There is another strategy to implement its called boyer-moore approximate string matching algorithm.
The usual approach to solve these problem using this algorithm and Levenshtein word distance is to compare the input to the possible outputs and choose the one with the smallest distance to the desired output.
There is one jar file for matching approximate string..
go through link and download frej.jar
http://sourceforge.net/projects/frej/files/
there is one method inside this jar file
Fuzzy.equals("jon","john");
it will return true in this type of approximate string.
Solr can do this, if you use the phonetic filter factory while indexing the text.
It is solr's speciality to search. And search for similar sounding words. However if you just want this, and not want other features offered by solr, then you can use the source available here.
There are lots of theories and methods to estimate the match of 2 strings
Giving a blunt true/false result seems strange since "jon" really doesn't equal "john", it's close but doesn't match
One great academic work implementing quite a few estimation methods is called "SecondString.jar" - site link
Most implemented methods give some score to the match, this score depends on the method used
Example: Lets define "Edit Distance" as the number of char changes required in str1 to get to str2 in this case "jon" --> "john" requires 1 char addition naturally for this method a lower score is better
This article provides a detailed explanation and complete code on a Trie-based Java implementation of approximate string matching: Fast and Easy Levenshtein distance using a Trie.
The search function returns a list of all words that are less than the given
maximum distance from the target word
def search( word, maxCost ):
# build first row
currentRow = range( len(word) + 1 )
results = []
# recursively search each branch of the trie
for letter in trie.children:
searchRecursive( trie.children[letter], letter, word, currentRow,
results, maxCost )
return results
This recursive helper is used by the search function above. It assumes that
the previousRow has been filled in already.
def searchRecursive( node, letter, word, previousRow, results, maxCost ):
columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]
# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):
insertCost = currentRow[column - 1] + 1
deleteCost = previousRow[column] + 1
if word[column - 1] != letter:
replaceCost = previousRow[ column - 1 ] + 1
else:
replaceCost = previousRow[ column - 1 ]
currentRow.append( min( insertCost, deleteCost, replaceCost ) )
# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
results.append( (node.word, currentRow[-1] ) )
# if any entries in the row are less than the maximum cost, then
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
for letter in node.children:
searchRecursive( node.children[letter], letter, word, currentRow,
results, maxCost )
© 2022 - 2024 — McMap. All rights reserved.