Checking fuzzy/approximate substring existing in a longer string, in Python?
Asked Answered
K

6

79

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
Klement answered 19/7, 2013 at 7:51 Comment(2)
The regex solution does work for the given example. What problem are you having with it?Capp
I would name the variable as txt or something else, instead of input which is a python keyword :).Chickaree
E
28

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')
Escape answered 30/10, 2013 at 23:59 Comment(5)
FWIW, the fuzzy matching sadly might be removed for the version intended to be added to the standard library... if it ever actually gets in, that is.Capp
I couldn't get this to work with the OP's "manhattan" example -- can you show the code to make that work?Carbamidine
It's a shame that regex.match('(test){e<=1}', '123 test') doesn't match anythingComanche
@AwaisHussain - did you try regex.search('(test){e<=1}', '123 test')? I wouldn't expect that match call to return a hit.Verina
@EthanFurman did you try this? for m in regex.findall('(manhattan){e<2}', "thelargemanhatanproject is a great project in themanhattincity"): print(m). This gives manhatan and manhattin.Chickaree
D
23

I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.

process.extractBests takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.

find_near_matches takes the result of process.extractBests and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist of find_near_matches is 'Levenshtein distance' which has to be adjusted to suit the needs.

from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of 
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

To test:

query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

Output:

query: manhattan  
string: thelargemanhatanproject is a great project in themanhattincity  
match: manhatan  
index: 8  
match: manhattin  
index: 49  

query: citi  
string: thelargemanhatanproject is a great project in themanhattincity  
match: city  
index: 58  

query: greet  
string: thelargemanhatanproject is a great project in themanhattincity  
match: great  
index: 29 
Dasya answered 4/6, 2015 at 21:0 Comment(4)
index = ls.find(match) will only return the first occurrence.Vindicate
Excellent, but currently falls foul of github.com/taleinat/fuzzysearch/issues/13 - so you need to add: if len(ls) < len(qs): return; yield (following https://mcmap.net/q/261141/-how-to-write-python-generator-function-that-never-yields-anything).Clicker
Note: only works when trying to match one word with another word. Doesn't work for multi-word matches.Hibbitts
Note: only works when trying to match one word with another word. Doesn't work for multi-word matches. E.g. fail to see the close match with large_string = "thelargema nhatanproject is " and query_string = "thelarge manhatanproject".Hibbitts
P
23

The approaches above are good, but I needed to find a small needle in lots of hay, and ended up approaching it like this:

from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs

needle = "this is the string we want to find"
hay    = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u" ".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio() 
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

Yields:

0.72972972973 this string is the one we wanted to find
Pralltriller answered 15/7, 2015 at 14:35 Comment(4)
How do we get the starting index of the substring?Anthe
Not bulletproof, but you could simply use the find string method index = hay.find(max_sim_string)Bethanie
I tested with needle = "this is the string we want to find" and hay = "text text lots of text and this is the string we want to find more". Should give a score of 1 since it's a perfect match, but gives a score of 0.944 instead.Hibbitts
@FranckDernoncourt can you figure out why? Check the needle length vs your ngram length :)Pralltriller
O
21

How about using difflib.SequenceMatcher.get_matching_blocks?

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

UPDATE

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

Above code print: ['manhatan', 'manhattn']

Orthopsychiatry answered 19/7, 2013 at 8:11 Comment(8)
How to retrieve the fuzzily matched substring from the blocks ? eg "manhatan"Klement
@DhruvPathak, a = "thelargemanhatanproject"; b = "manhattan"; s = difflib.SequenceMatcher(None, a, b); ''.join(a[i:i+n] for i, j, n in s.get_matching_blocks() if n)Orthopsychiatry
it does not extract "manhatan" from the large_string, it results in the query_string "manhattan" (double t)Klement
@DhruvPathak, ?? The code in my comment yield 'manhatan'. (single t)Orthopsychiatry
Can your code be extended to give multiple substrings too, as shown in the example edit in my quetsion ?Klement
@DhruvPathak, I added another code, if it is not exactly what you want.Orthopsychiatry
the code still prints incorrectly, the expected output is ['manhatan','manhattin'] where as your sample code prints ['manhatan', 'manhattn']Klement
Note: Doesn't work for multi-word matches, e.g. if large_string = "Rectangular Marquee tool and query_string = large_string, then no match is returned (I tested the codeblock added in revision 2).Hibbitts
S
17

Recently I've written an alignment library for Python: https://github.com/eseraygun/python-alignment

Using it, you can perform both global and local alignments with arbitrary scoring strategies on any pair of sequences. Actually, in your case, you need semi-local alignments as you don't care for the substrings of query_string. I've simulated semi-local algorithm using local alignment and some heuristics in the following code but it is easy to extend the library for a proper implementation.

Here is the example code in the README file modified for your case.

from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

The output for minScore=5 is as follows.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

If you remove the minScore argument, you will get only the best scoring matches.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

Note that all algorithms in the library have O(n * m) time complexity, n and m being the lengths of the sequences.

Spacecraft answered 30/10, 2013 at 22:9 Comment(0)
V
4

I ran into this problem, and I found that neither of the top two answers worked. Instead, I used the following algorithm to detect the minimally wrong fuzzy match:

def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
    """Find the closest matching fuzzy substring.

    Args:
        major: the string to search in
        minor: the string to search with
        errs: the total number of errors

    Returns:
        Optional[regex.Match] object
    """
    errs_ = 0
    s = regex.search(f"({minor}){{e<={errs_}}}", major)
    while s is None and errs_ <= errs:
        errs_ += 1
        s = regex.search(f"({minor}){{e<={errs_}}}", major)
    return s

This has the benefit of returning exact matches if they exist, and escalating fuzziness as needed.

Vermiculation answered 9/8, 2022 at 21:59 Comment(2)
Please do see the update in my question, I could find a module that worked wellKlement
@Klement I see your solution, and I don't doubt it works for your use case.Vermiculation

© 2022 - 2024 — McMap. All rights reserved.