Checking fuzzy/approximate substring existing in a longer string, in Python?

Asked 19/7, 2013 at 7:51 Answered 9/8, 2022 at 21:59

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

Klement answered 19/7, 2013 at 7:51 Comment(2)

The regex solution does work for the given example. What problem are you having with it? – Capp 31/5, 2015 at 11:22

I would name the variable as txt or something else, instead of input which is a python keyword :). – Chickaree 7/12, 2023 at 1:30

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')

Escape answered 30/10, 2013 at 23:59 Comment(5)

FWIW, the fuzzy matching sadly might be removed for the version intended to be added to the standard library... if it ever actually gets in, that is. – Capp 31/5, 2015 at 11:26

I couldn't get this to work with the OP's "manhattan" example -- can you show the code to make that work? – Carbamidine 1/6, 2015 at 17:25

It's a shame that regex.match('(test){e<=1}', '123 test') doesn't match anything – Comanche 25/11, 2017 at 17:44

@AwaisHussain - did you try regex.search('(test){e<=1}', '123 test')? I wouldn't expect that match call to return a hit. – Verina 18/9, 2018 at 19:14

@EthanFurman did you try this? for m in regex.findall('(manhattan){e<2}', "thelargemanhatanproject is a great project in themanhattincity"): print(m). This gives manhatan and manhattin. – Chickaree 7/12, 2023 at 1:36

I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.

process.extractBests takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.

find_near_matches takes the result of process.extractBests and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist of find_near_matches is 'Levenshtein distance' which has to be adjusted to suit the needs.

from fuzzysearch import find_near_matches
from fuzzywuzzy import process

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

def fuzzy_extract(qs, ls, threshold):
    '''fuzzy matches 'qs' in 'ls' and returns list of 
    tuples of (word,index)
    '''
    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
        print('word {}'.format(word))
        for match in find_near_matches(qs, word, max_l_dist=1):
            match = word[match.start:match.end]
            print('match {}'.format(match))
            index = ls.find(match)
            yield (match, index)

To test:

query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
    print('match: {}\nindex: {}'.format(match, index))

Output:

query: manhattan  
string: thelargemanhatanproject is a great project in themanhattincity  
match: manhatan  
index: 8  
match: manhattin  
index: 49  

query: citi  
string: thelargemanhatanproject is a great project in themanhattincity  
match: city  
index: 58  

query: greet  
string: thelargemanhatanproject is a great project in themanhattincity  
match: great  
index: 29

Dasya answered 4/6, 2015 at 21:0 Comment(4)

index = ls.find(match) will only return the first occurrence. – Vindicate 26/5, 2020 at 14:32

Excellent, but currently falls foul of github.com/taleinat/fuzzysearch/issues/13 - so you need to add: if len(ls) < len(qs): return; yield (following https://mcmap.net/q/261141/-how-to-write-python-generator-function-that-never-yields-anything). – Clicker 25/6, 2020 at 13:1

Note: only works when trying to match one word with another word. Doesn't work for multi-word matches. – Hibbitts 9/5 at 14:41

Note: only works when trying to match one word with another word. Doesn't work for multi-word matches. E.g. fail to see the close match with large_string = "thelargema nhatanproject is " and query_string = "thelarge manhatanproject". – Hibbitts 9/5 at 17:22

The approaches above are good, but I needed to find a small needle in lots of hay, and ended up approaching it like this:

from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs

needle = "this is the string we want to find"
hay    = "text text lots of text and more and more this string is the one we wanted to find and here is some more and even more still"

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u" ".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio() 
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

Yields:

0.72972972973 this string is the one we wanted to find

Pralltriller answered 15/7, 2015 at 14:35 Comment(4)

How do we get the starting index of the substring? – Anthe 24/3, 2022 at 17:2

Not bulletproof, but you could simply use the find string method index = hay.find(max_sim_string) – Bethanie 6/10, 2023 at 19:46

I tested with needle = "this is the string we want to find" and hay = "text text lots of text and this is the string we want to find more". Should give a score of 1 since it's a perfect match, but gives a score of 0.944 instead. – Hibbitts 9/5 at 14:47

@FranckDernoncourt can you figure out why? Check the needle length vs your ngram length :) – Pralltriller 9/5 at 23:1

How about using difflib.SequenceMatcher.get_matching_blocks?

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

UPDATE

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

Above code print: ['manhatan', 'manhattn']

Orthopsychiatry answered 19/7, 2013 at 8:11 Comment(8)

How to retrieve the fuzzily matched substring from the blocks ? eg "manhatan" – Klement 19/7, 2013 at 9:27

@DhruvPathak,

a = "thelargemanhatanproject"; b = "manhattan"; s = difflib.SequenceMatcher(None, a, b); ''.join(a[i:i+n] for i, j, n in s.get_matching_blocks() if n)

– Orthopsychiatry 19/7, 2013 at 9:36

it does not extract "manhatan" from the large_string, it results in the query_string "manhattan" (double t) – Klement 19/7, 2013 at 13:45

@DhruvPathak, ?? The code in my comment yield 'manhatan'. (single t) – Orthopsychiatry 19/7, 2013 at 13:50

Can your code be extended to give multiple substrings too, as shown in the example edit in my quetsion ? – Klement 19/7, 2013 at 18:26

@DhruvPathak, I added another code, if it is not exactly what you want. – Orthopsychiatry 20/7, 2013 at 3:21

the code still prints incorrectly, the expected output is ['manhatan','manhattin'] where as your sample code prints ['manhatan', 'manhattn'] – Klement 24/10, 2013 at 8:33

Note: Doesn't work for multi-word matches, e.g. if large_string = "Rectangular Marquee tool and query_string = large_string, then no match is returned (I tested the codeblock added in revision 2). – Hibbitts 9/5 at 14:56

Recently I've written an alignment library for Python: https://github.com/eseraygun/python-alignment

Using it, you can perform both global and local alignments with arbitrary scoring strategies on any pair of sequences. Actually, in your case, you need semi-local alignments as you don't care for the substrings of query_string. I've simulated semi-local algorithm using local alignment and some heuristics in the following code but it is easy to extend the library for a proper implementation.

Here is the example code in the README file modified for your case.

from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

The output for minScore=5 is as follows.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

If you remove the minScore argument, you will get only the best scoring matches.

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

Note that all algorithms in the library have O(n * m) time complexity, n and m being the lengths of the sequences.

Spacecraft answered 30/10, 2013 at 22:9 Comment(0)

I ran into this problem, and I found that neither of the top two answers worked. Instead, I used the following algorithm to detect the minimally wrong fuzzy match:

def fuzzy_substring_search(cls, major: str, minor: str, errs: int = 4) -> Optional[regex.Match]:
    """Find the closest matching fuzzy substring.

    Args:
        major: the string to search in
        minor: the string to search with
        errs: the total number of errors

    Returns:
        Optional[regex.Match] object
    """
    errs_ = 0
    s = regex.search(f"({minor}){{e<={errs_}}}", major)
    while s is None and errs_ <= errs:
        errs_ += 1
        s = regex.search(f"({minor}){{e<={errs_}}}", major)
    return s

This has the benefit of returning exact matches if they exist, and escalating fuzziness as needed.

Vermiculation answered 9/8, 2022 at 21:59 Comment(2)

Please do see the update in my question, I could find a module that worked well – Klement 30/8, 2022 at 10:48

@Klement I see your solution, and I don't doubt it works for your use case. – Vermiculation 2/9, 2022 at 12:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags