How to find a similar substring inside a large string with a similarity score in python?

Asked 5/1, 2018 at 16:24 Answered 5/8, 2023 at 3:10

Solved python string nlp distance similarity

What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

In the above example, all the words of text1 are present in the text2 completely, hence the similarity should be 100%.

If some words of text1 are missing, the score shall be less.

I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.

I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.

Saturation answered 5/1, 2018 at 16:24 Comment(5)

This question is about algorithms rather than python. – Necklace 5/1, 2018 at 16:30

This might be a duplicate to #17388713 But I would check out that library for comparing two strings. Then you will need to find a way to find the sub string. It might work with sub strings. Havent tested to try that out though. – Boggess 5/1, 2018 at 16:49

Have you tried Levenshtein distance? It may be useful here, included it in my answer as well. – Phoney 5/1, 2018 at 18:33

See also: #15173725 – Carchemish 6/1, 2018 at 16:54

Dupe of #17388713 too? – Carchemish 6/1, 2018 at 16:55

Based on your description, how about:

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

In case it had a missing word for example:

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

Additionally, you can do as @roadrunner suggest and split b and save it as a set to speed up your performance with b = set(b.split(" ")). This will reduce that part complexity to O(1) and improve the overall algorithm to a O(n) complexity.

Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.

Phoney answered 5/1, 2018 at 16:49 Comment(5)

Nice answer, definitely the most simplest approach. I think you need to split b here though? – Errol 5/1, 2018 at 18:16

@Errol thanks. This works without having to split b (pasted directly from my console). However, you can split it if you want the performance enhancement, which would be better I guess. – Phoney 5/1, 2018 at 18:19

Yeah you're right. This gives the OP a good starting point, he/she can choose what to use and what not to use, depending on the text he/she is working with. – Errol 5/1, 2018 at 18:25

Thank you very much DarkCygnus and @RoadRunner. This and the above one, are a perfect fit for what I was looking for. I tried Levenshtein distance in my first attempt, but since my Text2 has unwanted text around it, I would get a lower similarity score when compared with Text1. Hence was looking for an alternative. – Saturation 7/1, 2018 at 21:27

Clad I could help @SanjayKamath :) consider accepting an answer if it solved your problem once you've tried. Cheers – Phoney 8/1, 2018 at 0:58

You could also use a collections.defaultdict to store the counts of words in word_a that exist in word_b, then sum() the counts divided by the length of word_a at the end:

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

Which Outputs:

0.875

Note: Since we only care about checking if words in word_a exist in word_b, then converting word_b to a set() will allow O(1) lookup, instead of keeping it a list, which will be O(n). This then makes the overall time complexity of the above code O(n).

Errol answered 5/1, 2018 at 17:52 Comment(3)

Interesting addition. So you say word in word_b is going to be O(1) if it is a set then? Still, you are doing for word in word_a, so that makes your algorithm O(n) right? – Phoney 5/1, 2018 at 18:0

@Phoney Yeah I agree, this solution is O(n). I only added the set lookup optimization to slightly improve performance. I might have to run some tests on some bigger samples to see if it really makes much of a difference. – Errol 5/1, 2018 at 18:4

Indeed. OP can also tailor these options to fit his needs depending on the data he is handling. Cheers – Phoney 5/1, 2018 at 18:6

Similar to DarkCygbus but the similarity is based on its count total character instead of words. On the other hand, this script has only checked coincidence with the complete words (text_2.split())

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

Patin answered 5/1, 2018 at 17:11 Comment(7)

Although this seems like an alternative approach, OP seems to clearly ask about some way of considering similarities between words, as he mentions it several times in the Question, i.e.: "But not about a score of substring inside another string.", "In the above example, the words...is present in the Text2 completely, hence the similarity should be 100%." – Phoney 5/1, 2018 at 17:26

Besides, I suspect that your approach may give some unexpected behavior. If you compare by characters instead of by words, you may get 100% match in things like "asdf" and "as you did first", as it contains all characters, even though they are completely different things. – Phoney 5/1, 2018 at 17:28

I don't think so, if you check in your script 'a' like a = "cat is sleeping on the mat lee ivi" .. its similarity will be 100%, in my script 81%. – Stoffel 5/1, 2018 at 17:53

Words like 'ivi' and 'lee' don't make sense and probably are not present on typed text analysis. Still, my last comment is true, but I agree that my approach may have different results if unrealistic words are present. Should be up to OP to decide what fits his needs better. – Phoney 5/1, 2018 at 18:9

With the word 'ping' happen the same and it isn't unrealistic word ... I think that you only change word_b.split() instead of word_b (in if) and it will be right. – Stoffel 5/1, 2018 at 18:15

You said "you may get 100% match in things like "asdf" and "as you did first", as it contains all characters, even though they are completely different things" .... and that is wrong, If you look my code, I check complete word but the similiraty value uses size the word. – Stoffel 5/1, 2018 at 18:19

I said I "suspect" that may happen. Your approach is also valid like I said, but its up to OP to see what fits his needs better. I will leave the subject here, as comments are not for extended discussion. – Phoney 5/1, 2018 at 18:21

I think this can be achieved with Levenshtein distancing combined with substring matching. What you can do is split a sentence into smaller words (using spaces as a separator), then run the Levenshtein matching algorithm to match individual words with your search string. Something like:

def similar_word(string, substring):
    threshold=2

    def levenshtein_distance(s1, s2):
        m, n = len(s1), len(s2)
        dp = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0: dp[i][j] = j
                elif j == 0: dp[i][j] = i
                elif s1[i - 1] == s2[j - 1]: dp[i][j] = dp[i - 1][j - 1]
                else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
        return dp[m][n]

    for i in range(len(string) - len(substring) + 1):
        distance = levenshtein_distance(string[i:i + len(substring)], substring)
        if distance <= threshold: return True
    
    return False

https://gist.github.com/4f77616973/66a784c4c5921359299d603419a8f01b

Since you want the score, you can tweak the above code to return the distance instead of True/False.

Hope that helps! :)

Joist answered 5/8, 2023 at 3:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags