This function finds best matching substring of variable length.
The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.
Code summary:
1. Scan the corpus for match values in steps of size step
to find the approximate location of highest match value, pos
.
2. Find the substring in the vicinity of pos
with the highest match value, by adjusting the left/right positions of the substring.
from difflib import SequenceMatcher
def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
"""Return best matching substring of corpus.
Parameters
----------
query : str
corpus : str
step : int
Step size of first match-value scan through corpus. Can be thought of
as a sort of "scan resolution". Should not exceed length of query.
flex : int
Max. left/right substring position adjustment value. Should not
exceed length of query / 2.
Outputs
-------
output0 : str
Best matching substring.
output1 : float
Match ratio of best matching substring. 1 is perfect match.
"""
def _match(a, b):
"""Compact alias for SequenceMatcher."""
return SequenceMatcher(None, a, b).ratio()
def scan_corpus(step):
"""Return list of match values from corpus-wide scan."""
match_values = []
m = 0
while m + qlen - step <= len(corpus):
match_values.append(_match(query, corpus[m : m-1+qlen]))
if verbose:
print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
m += step
return match_values
def index_max(v):
"""Return index of max value."""
return max(range(len(v)), key=v.__getitem__)
def adjust_left_right_positions():
"""Return left/right positions for best string match."""
# bp_* is synonym for 'Best Position Left/Right' and are adjusted
# to optimize bmv_*
p_l, bp_l = [pos] * 2
p_r, bp_r = [pos + qlen] * 2
# bmv_* are declared here in case they are untouched in optimization
bmv_l = match_values[p_l // step]
bmv_r = match_values[p_l // step]
for f in range(flex):
ll = _match(query, corpus[p_l - f: p_r])
if ll > bmv_l:
bmv_l = ll
bp_l = p_l - f
lr = _match(query, corpus[p_l + f: p_r])
if lr > bmv_l:
bmv_l = lr
bp_l = p_l + f
rl = _match(query, corpus[p_l: p_r - f])
if rl > bmv_r:
bmv_r = rl
bp_r = p_r - f
rr = _match(query, corpus[p_l: p_r + f])
if rr > bmv_r:
bmv_r = rr
bp_r = p_r + f
if verbose:
print("\n" + str(f))
print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))
return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])
if not case_sensitive:
query = query.lower()
corpus = corpus.lower()
qlen = len(query)
if flex >= qlen/2:
print("Warning: flex exceeds length of query / 2. Setting to default.")
flex = 3
match_values = scan_corpus(step)
pos = index_max(match_values) * step
pos_left, pos_right, match_value = adjust_left_right_positions()
return corpus[pos_left: pos_right].strip(), match_value
Example:
query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)
Some good heuristic advice is to always keep step < len(query) * 3/4
, and flex < len(query) / 3
. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.
Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max()
to return a list of indices for the n
highest values of the input list, and loop over adjust_left_right_positions()
for values in that list.