Java: how to find the most probable string in a list of strings?
Asked Answered
S

5

6

I have a list of strings in Java containing first name of a person with dissimilar spellings (not entirely different). For example, John may be spelled as Jon, Jawn, Jaun etc. How should I retrieve the most appropriate string in this list. If anyone can suggest a method how to use Soundex in this case, it shall be of great help.

Sacerdotal answered 29/9, 2012 at 6:8 Comment(0)
T
4

You have use approximate string matching algorithm , There are several strategies to implement this . Blur is a Trie-based Java implementation of approximate string matching based on the Levenshtein word distance.

There is another strategy to implement its called boyer-moore approximate string matching algorithm.

The usual approach to solve these problem using this algorithm and Levenshtein word distance is to compare the input to the possible outputs and choose the one with the smallest distance to the desired output.

Tango answered 29/9, 2012 at 6:17 Comment(0)
U
4

There is one jar file for matching approximate string..

go through link and download frej.jar

http://sourceforge.net/projects/frej/files/

there is one method inside this jar file

Fuzzy.equals("jon","john");

it will return true in this type of approximate string.

Unreserve answered 29/9, 2012 at 6:42 Comment(0)
G
1

Solr can do this, if you use the phonetic filter factory while indexing the text.

It is solr's speciality to search. And search for similar sounding words. However if you just want this, and not want other features offered by solr, then you can use the source available here.

Globuliferous answered 29/9, 2012 at 6:35 Comment(0)
G
1

There are lots of theories and methods to estimate the match of 2 strings

Giving a blunt true/false result seems strange since "jon" really doesn't equal "john", it's close but doesn't match

One great academic work implementing quite a few estimation methods is called "SecondString.jar" - site link

Most implemented methods give some score to the match, this score depends on the method used

Example: Lets define "Edit Distance" as the number of char changes required in str1 to get to str2 in this case "jon" --> "john" requires 1 char addition naturally for this method a lower score is better

Gregoriagregorian answered 27/3, 2014 at 15:47 Comment(0)
S
1

This article provides a detailed explanation and complete code on a Trie-based Java implementation of approximate string matching: Fast and Easy Levenshtein distance using a Trie.

The search function returns a list of all words that are less than the given

maximum distance from the target word

def search( word, maxCost ):

# build first row
currentRow = range( len(word) + 1 )

results = []

# recursively search each branch of the trie
for letter in trie.children:
    searchRecursive( trie.children[letter], letter, word, currentRow, 
        results, maxCost )

return results

This recursive helper is used by the search function above. It assumes that

the previousRow has been filled in already.

def searchRecursive( node, letter, word, previousRow, results, maxCost ):

columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]

# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):

    insertCost = currentRow[column - 1] + 1
    deleteCost = previousRow[column] + 1

    if word[column - 1] != letter:
        replaceCost = previousRow[ column - 1 ] + 1
    else:                
        replaceCost = previousRow[ column - 1 ]

    currentRow.append( min( insertCost, deleteCost, replaceCost ) )

# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
    results.append( (node.word, currentRow[-1] ) )

# if any entries in the row are less than the maximum cost, then 
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
    for letter in node.children:
        searchRecursive( node.children[letter], letter, word, currentRow, 
            results, maxCost )
Supernatant answered 12/7, 2014 at 19:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.