Find best substring match
Asked Answered
E

4

20

I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus)

I've developped a method based on difflib, where I split my corpus into ngrams of size n (length of query).

import difflib
from nltk.util import ngrams

def get_best_match(query, corpus):
    ngs = ngrams( list(corpus), len(query) )
    ngrams_text = [''.join(x) for x in ngs]
    return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)

it works as I want when the difference between the query and the matched string are just character replacements.

query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"

match = get_best_match(query, corpus)
# match = "1psum d0l0r"

But when the difference is character deletion, it is not.

query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"

match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"

Is there a way to get a more flexible result size ( as for expected_match ) ?

EDIT 1:

  • The actual use of this script is to match queries (strings) with a messy ocr output.
  • As I said in the question, the ocr can confound characters, and even miss them.
  • If possible consider also the case when a space is missing between words.
  • A best match, is the one that does not include characters from other words than those on the query.

EDIT 2:

The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3} to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.

Endosteum answered 15/3, 2016 at 13:54 Comment(3)
ngrams will struggle with the double delete "dlr" in your sample, levenshtein may give better results?Protege
It seems a Levenshtein automata might be the data structure you're looking for, but, unfortunately, I'm not aware of any existing Python implementations. Update: here's another blogpost with some Python code with it, not a production-ready library, but still might be a good start.Cowes
If I've correctly understood the way the automaton works, it considers the corpus as a list of separate words? In that case it won't solve OP's problem because he is matching queries with OCR output (text parsed from image recognition i suppose), which cannot be expected to separate words correctly. Really cool algorithm though, it's going in my book!Coir
C
16

This function finds best matching substring of variable length.

The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.

Code summary: 1. Scan the corpus for match values in steps of size step to find the approximate location of highest match value, pos. 2. Find the substring in the vicinity of pos with the highest match value, by adjusting the left/right positions of the substring.

from difflib import SequenceMatcher

def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
    """Return best matching substring of corpus.

    Parameters
    ----------
    query : str
    corpus : str
    step : int
        Step size of first match-value scan through corpus. Can be thought of
        as a sort of "scan resolution". Should not exceed length of query.
    flex : int
        Max. left/right substring position adjustment value. Should not
        exceed length of query / 2.

    Outputs
    -------
    output0 : str
        Best matching substring.
    output1 : float
        Match ratio of best matching substring. 1 is perfect match.
    """

    def _match(a, b):
        """Compact alias for SequenceMatcher."""
        return SequenceMatcher(None, a, b).ratio()

    def scan_corpus(step):
        """Return list of match values from corpus-wide scan."""
        match_values = []

        m = 0
        while m + qlen - step <= len(corpus):
            match_values.append(_match(query, corpus[m : m-1+qlen]))
            if verbose:
                print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
            m += step

        return match_values

    def index_max(v):
        """Return index of max value."""
        return max(range(len(v)), key=v.__getitem__)

    def adjust_left_right_positions():
        """Return left/right positions for best string match."""
        # bp_* is synonym for 'Best Position Left/Right' and are adjusted 
        # to optimize bmv_*
        p_l, bp_l = [pos] * 2
        p_r, bp_r = [pos + qlen] * 2

        # bmv_* are declared here in case they are untouched in optimization
        bmv_l = match_values[p_l // step]
        bmv_r = match_values[p_l // step]

        for f in range(flex):
            ll = _match(query, corpus[p_l - f: p_r])
            if ll > bmv_l:
                bmv_l = ll
                bp_l = p_l - f

            lr = _match(query, corpus[p_l + f: p_r])
            if lr > bmv_l:
                bmv_l = lr
                bp_l = p_l + f

            rl = _match(query, corpus[p_l: p_r - f])
            if rl > bmv_r:
                bmv_r = rl
                bp_r = p_r - f

            rr = _match(query, corpus[p_l: p_r + f])
            if rr > bmv_r:
                bmv_r = rr
                bp_r = p_r + f

            if verbose:
                print("\n" + str(f))
                print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
                print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
                print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
                print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))

        return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])

    if not case_sensitive:
        query = query.lower()
        corpus = corpus.lower()

    qlen = len(query)

    if flex >= qlen/2:
        print("Warning: flex exceeds length of query / 2. Setting to default.")
        flex = 3

    match_values = scan_corpus(step)
    pos = index_max(match_values) * step

    pos_left, pos_right, match_value = adjust_left_right_positions()

    return corpus[pos_left: pos_right].strip(), match_value

Example:

query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)

Some good heuristic advice is to always keep step < len(query) * 3/4, and flex < len(query) / 3. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.

Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max() to return a list of indices for the n highest values of the input list, and loop over adjust_left_right_positions() for values in that list.

Coir answered 21/3, 2016 at 13:30 Comment(8)
Thank you for your answer. it's a well done function as it has parameters. I've made tests to compare it with my method. your function keeps often a space at the end of the matched word, but it's something I can easily fix. I think that I can combine both methods to improve the results. +1 +bountyEndosteum
Thanks, glad I could help. Spaces can be kept at the end in cases where removing it wouldn't change the edit distance, since a substitution and an insertion costs the same. I edited the answer to call the .strip() function on the output to correct for this.Coir
Does @UlfAslak answer assumes that corpus has to be longer than query?Skeg
@RhoPhi no it will work for a query that's longer than the corpus. Not sure why you would do that, but if you think of using this in your code and you're afraid it will cast an error, it won't. It will just return the corpus.Coir
Is there a Python 3 version available?Boote
@Iddoweiner probably not too difficult to change it. You're welcome to suggest and edit!Coir
@Iddoweiner just change all print statements to print( function calls and replace xrange with rangeTorytoryism
In scan_corpus(), are we calling _match(query, corpus[m : m-1+qlen]) like in the code? Or should we use _match(query, corpus[m: m + qlen]) as is printed out in the verbose block? (note m-1 vs. m)Sling
C
3

The main path to a solution uses finite state automata (FSA) of some kind. If you want a detailed summary of the topic, check this dissertation out (PDF link). Error-based models (including Levenshtein automata and transducers, the former of which Sergei mentioned) are valid approaches to this. However, stochastic models, including various types of machine learning approaches integrated with FSAs, are very popular at the moment.

Since we are looking at edit distances (effectively misspelled words), the Levenshtein approach is good and relatively simple. This paper (as well as the dissertation; also PDF) give a decent outline of the basic idea and it also explicitly mentions the application to OCR tasks. However, I will review some of the key points below.

The basic idea is that you want to build an FSA that computes both the valid string as well as all strings up to some error distance (k). In the general case, this k could be infinite or the size of the text, but this is mostly irrelevant for OCR (if your OCR could even potentially return bl*h where * is the rest of the entire text, I would advise finding a better OCR system). Hence, we can restrict regex's like bl*h from the set of valid answers for the search string blah. A general, simple and intuitive k for your context is probably the length of the string (w) minus 2. This allows b--h to be a valid string for blah. It also allows bla--h, but that's okay. Also, keep in mind that the errors can be any character you specify, including spaces (hence 'multiword' input is solvable).

The next basic task is to set up a simple weighted transducer. Any of the OpenFST Python ports can do this (here's one). The logic is simple: insertions and deletions increment the weight while equality increments the index in the input string. You could also just hand code it as the guy in Sergei's comment link did.

Once you have the weights and associated indexes of the weights, you just sort and return. The computational complexity should be O(n(w+k)), since we will look ahead w+k characters in the worst case for each character (n) in the text.

From here, you can do all sorts of things. You could convert the transducer to a DFA. You could parallelize the system by breaking the text into w+k-grams, which are sent to different processes. You could develop a language model or confusion matrix that defines what common mistakes exist for each letter in the input set (and thereby restrict the space of valid transitions and the complexity of the associated FSA). The literature is vast and still growing so there are probably as many modifications as there are solutions (if not more).

Hopefully that answers some of your questions without giving any code.

Cherish answered 23/3, 2016 at 6:47 Comment(0)
J
1

I would try to build a regular expression template from the query string. The template could then be used to search the corpus for substrings that are likely to match the query. Then use difflib or fuzzywuzzy to check if the substring does match the query.

For example, a possible template would be to match at least one of the first two letters of the query, at least one of the last two letters of the query, and have approximately the right number of letters in between:

import re

query = "ipsum dolor"
corpus = ["lorem 1psum d0l0r sit amet",
          "lorem 1psum dlr sit amet",
          "lorem ixxxxxxxr sit amet"]

first_letter, second_letter = query[:2]
minimum_gap, maximum_gap = len(query) - 6, len(query) - 3
penultimate_letter, ultimate_letter = query[-2:]

fmt = '(?:{}.|.{}).{{{},{}}}(?:{}.|.{})'.format
pattern = fmt(first_letter, second_letter,
              minimum_gap, maximum_gap,
              penultimate_letter, ultimate_letter)

#print(pattern) # for debugging pattern

m = difflib.SequenceMatcher(None, "", query, False)

for c in corpus:
    for match in re.finditer(pattern1, c, re.IGNORECASE):
        substring = match.group()
        m.set_seq1(substring)
        ops = m.get_opcodes()

        # EDIT fixed calculation of the number of edits
        #num_edits = sum(1 for t,_,_,_,_ in ops if t != 'equal')
        num_edits = sum(max(i2-i1, j2-j1) for op,i1,i2,j1,j2 in ops if op != 'equal' )
        print(num_edits, substring)

Output:

3 1psum d0l0r
3 1psum dlr
9 ixxxxxxxr

Another idea is to use the characteristics of the ocr when building the regex. For example, if the ocr always gets certain letters correct, then when any of those letters are in the query, use a few of them in the regex. Or if the ocr mixes up '1', '!', 'l', and 'i', but never substitutes something else, then if one of those letters is in the query, use [1!il] in the regex.

Junko answered 21/3, 2016 at 14:11 Comment(2)
I don't think regex templates are good solutions for this kind of problems. Your solution outputs 'ixxxxxxxr' as a good solution with num_edits = 1.Endosteum
@Ghilas Thanks for catching the error. The regex is simply used to find places where the query might match. Then difflib.SequenceMatcher is used to do the comparison. I was mis-calculating the edit distance. Fixed.Junko
M
0

Using Levenshtein distances to find the closest substring is a good way to find the best substring match in a corpus. Using Levenshtein distances is also much faster than difflib's ratio function in most cases.

This code requires python-Levenshtein

pip install levenshtein

The code below finds the best substring match in two stages.

The first stage constructs ngrams of the same size as the query using a sliding window over the corpus (with sliding window's step = len(query)//2). The ngram with the minimum Levenshtein distance from the query is found. This ngram is the region of the corpus around which the best match exists.

The second stage performs a more thorough search over the region identified (narrowed_corpus). This is done by constructing ngrams of varying lengths over narrowed_corpus with step=1. The best substring match is the ngram with the minimum Levenshtein distance from the query. The lengths of ngrams constructed depends on the argument step_factor. Lengths considered are

(len(query)//2 - 1), (len(query)//2 - 1 + step), (len(query)//2 - 1 + 2*step) ... (2 * len(query) + 2)

Here step = len(query)//step_factor. Increasing step factor increases the number of ngram lengths considered hence increasing the chances of finding the optimal substring with minimum Levenshtein distance. However increasing it will lead to increased runtime (because more ngrams need to be checked) and memory usage (more ngrams need to be stored).

from Levenshtein import distance as ld
from math import inf

def get_best_match(query, corpus, case_sensitive=False, step_factor=128, favour_smallest=False):
'''
Returns the substring of the corpus with the least Levenshtein distance from the query
(May not always return optimal answer).

Arguments
- query: str
- corpus: str
- case_sensitive: bool
- step_factor: int  
    Influences the resolution of the thorough search once the general region is found.
    The increment in ngrams lengths used for the thorough search is calculated as len(query)//step_factor.
    Increasing this increases the number of ngram lengths used in the thorough search and increases the chances 
    of getting the optimal solution at the cost of runtime and memory.
- favour_smaller: bool
    Once the region of the best match is found, the search proceeds from larger to smaller ngrams or vice versa.
    If two or more ngrams have the same minimum distance then this flag controls whether the largest or smallest
    is returned.

Returns  
{
    'best_match': Best matching substring of corpus,
    'min_ld': Levenshtein distance of closest match
}
'''

    if not case_sensitive:
        query = query.casefold()
        corpus = corpus.casefold()

    corpus_len = len(corpus)
    query_len = len(query)
    query_len_by_2 = max(query_len // 2, 1)
    query_len_by_step_factor = max(query_len // step_factor, 1)

    closest_match_idx = 0
    min_dist = inf
    # Intial search of corpus checks ngrams of the same length as the query
    # Step is half the length of the query.
    # This is found to be good enough to find the general region of the best match in the corpus
    corpus_ngrams = [corpus[i:i+query_len] for i in range(0, corpus_len-query_len+1, query_len_by_2)]
    for idx, ngram in enumerate(corpus_ngrams):
        ngram_dist = ld(ngram, query)
        if ngram_dist < min_dist:
            min_dist = ngram_dist
            closest_match_idx = idx

    closest_match_idx = closest_match_idx * query_len_by_2
    closest_match = corpus[closest_match_idx: closest_match_idx + query_len]
    left = max(closest_match_idx - query_len_by_2 - 1, 0)
    right = min((closest_match_idx+query_len-1) + query_len_by_2 + 2, corpus_len)
    narrowed_corpus = corpus[left: right]
    narrowed_corpus_len = len(narrowed_corpus)

    # Once we have the general region of the best match we do a more thorough search in the region
    # This is done by considering ngrams of various lengths in the region using a step of 1
    ngram_lens = [l for l in range(narrowed_corpus_len, query_len_by_2 - 1, -query_len_by_step_factor)]
    if favour_smallest:
        ngram_lens = reversed(ngram_lens)
    # Construct sets of ngrams where each set has ngrams of a particular length made over the region with a step of 1
    narrowed_corpus_ngrams = [
        [narrowed_corpus[i:i+ngram_len] for i in range(0, narrowed_corpus_len-ngram_len+1)] 
        for ngram_len in ngram_lens
    ]

    # Thorough search of the region in which the best match exists
    for ngram_set in narrowed_corpus_ngrams:
        for ngram in ngram_set:
            ngram_dist = ld(ngram, query)
            if ngram_dist < min_dist:
                min_dist = ngram_dist
                closest_match = ngram

    return {
        'best_match': closest_match,
        'min_ld': min_dist
    }

query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"
match = get_best_match(query, corpus)
print("Best match")
print(match['best_match'])
print(f"Minimum Levenshtein distance: {match['min_ld']}")

Output

Best match
1psum dlr
Minimum Levenshtein distance: 3
Malanie answered 31/8, 2023 at 22:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.