Fuzzy String Searching with Whoosh in Python
Asked Answered
C

5

13

I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'. The following code works with simple fuzzy such, but cannot achieve a match on the above:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results

gives me:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>

Is it possible to achieve what I want with Whoosh? If not what other python based solutions do I have?

Concepcion answered 15/7, 2011 at 15:55 Comment(0)
M
12

You could match Co with Company using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co and Company is large. Co is similar to Company as Be is similar to Beast and ny to Company, You can imagine how bad and how large will be the search results.

However, if you want to match Compan or compani or Companee to Company you could do it by using a Personalized Class of FuzzyTerm with default maxdist equal to 2 or more :

maxdist – The maximum edit distance from the given text.

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

Then:

 qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)

You could match Co with Company by setting maxdist to 5 but this as I said give bad search results. I suggest to keep maxdist from 1 to 3.

If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations.

Note: older Whoosh versions has minsimilarity instead of maxdist.

Masterstroke answered 28/5, 2015 at 9:34 Comment(0)
T
3

For future reference, and there must be a better way to do this somehow, but here's my shot.

# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser

schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")

writer = idx.writer()

writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")

writer.commit()

s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)

for i in range(1,40):
    res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
    if len(res) > 0:
        for r in res:
            print "Potential match ( %s ): [  %s  ]" % ( i, r["name"] )
        break
    else:
        print "Pass: %s" % i

s.close()
Titrate answered 20/10, 2011 at 13:23 Comment(0)
G
1

Perhaps some of this stuff might help (string matching open sourced by the seatgeek guys):

https://github.com/seatgeek/fuzzywuzzy

Gunny answered 17/7, 2011 at 8:30 Comment(0)
P
0

For anyone stumbling across this question more recently, it looks like they've added fuzzy support natively, though it'd take a bit of work to satisfy the particular use case outlined here: https://whoosh.readthedocs.io/en/latest/parsing.html

Progression answered 21/6, 2022 at 18:43 Comment(2)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Beni
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewPrecursory
P
-3

You could use this function below to fuzz search a set of words against a phrase:

def FuzzySearch(text, phrase):
    """Check if word in phrase is contained in text"""
    phrases = phrase.split(" ")

    for x in range(len(phrases)):
        if phrases[x] in text:
            print("Match! Found " + phrases[x] + " in text")
        else:
            continue
Pindus answered 29/2, 2016 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.