Check if two strings contain the same set of words in Python
Asked Answered
W

4

7

I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?

EDIT:

I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.

Willner answered 25/6, 2019 at 20:22 Comment(4)
Do you need to distinguish duplicate words? Should to to match to to to?Spinifex
If not, turn the list of words into a set and test if the two sets are equal.Spinifex
Anything else I can use instead of sets?Willner
Is there any reason why you don't want to change the implementation?Mongolic
M
3

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}
Mongolic answered 25/6, 2019 at 21:1 Comment(9)
This actually works really fast. But could you elaborate on how I could make the above code faster. Just by changing the counter and use something else. Either user defined or in built functionWillner
I lose the order of words when I create a dictionary with the strings as keys. Yes I am able to get the count of similar sentences but then I lose the original orderWillner
@TheLastCoder: That's why I wrote the "more complex example". Anyway, there's a shorter version in "Yet another alternative".Mongolic
I understand how the dictionary works. What I want is to have dictionary keys that are already in the text with the count equal to the number of similar strings(similar means have the same set of words)Willner
@TheLastCoder: What would a key look like, for example for "Today is a good day"?Mongolic
yes! and it's values will be 2 since ['Today is a good day', 'Is today a good day'],Willner
@TheLastCoder: Yes what? What would be the common key for ['Today is a good day', 'Is today a good day']?Mongolic
Anything, maybe the one that comes firstWillner
@TheLastCoder: All the information you need is inside vocab = {'a day good is today': ['Today is a good day', 'Is today a good day'], 'a b c': ['a b c', 'c b a'], 'a a b c': ['a a b c']}. If you have a sentence, you just need to call vocab[sorted_words(sentence)] to get all the sentences with the same words, or vocab[sorted_words(sentence)][0] to get the first one.Mongolic
N
3

Try something like

set(sentence.split(" ")) == set(line.split(" "))

Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.

Your full code will look like

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

Neptunium answered 25/6, 2019 at 20:27 Comment(10)
There's really not much difference between a set, a dict and a counter. A set is basically a dict in which values are ignored. It's much better to use a O(1) or O(n) solution with counters than O(n**2) with sets.Mongolic
I am sorry, I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster optionWillner
@EricDuminil Is there anything that I can use instead of sets?Willner
@EricDuminil The counter solution has to compare both the keys and values, the set only has to compare keys. They're both O(n).Spinifex
@TheLastCoder: sets and counters are perfectly fine. You just need to find the correct keys and use sets the way they're supposed to be used : not iterating over every key.Mongolic
@Barmar: The problem here is the structure of the code and that there's one unneeded loop. It doesn't make any noticeable difference at all if both keys and values are compared, or just the keys. OP's code seems to be at least O(n**2), but could be O(n).Mongolic
@Eric I told you, this is not the actual code I am running, I am generating the vocab dictionary withing the loop, so I have to put it inside. I am really sorry for the confusion. Wait, let me edit the question. By far set has been the fastest implementationWillner
@Eric Please check the code again. Hope it makes sense. Do ask me if you need any clarification.Willner
Edited the solution. Thank you for all your inputs.Neptunium
Sorry for being blunt, but Counter wasn't the problem, and your code isn't the solution either. There simply shouldn't be two for loops.Mongolic
M
3

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}
Mongolic answered 25/6, 2019 at 21:1 Comment(9)
This actually works really fast. But could you elaborate on how I could make the above code faster. Just by changing the counter and use something else. Either user defined or in built functionWillner
I lose the order of words when I create a dictionary with the strings as keys. Yes I am able to get the count of similar sentences but then I lose the original orderWillner
@TheLastCoder: That's why I wrote the "more complex example". Anyway, there's a shorter version in "Yet another alternative".Mongolic
I understand how the dictionary works. What I want is to have dictionary keys that are already in the text with the count equal to the number of similar strings(similar means have the same set of words)Willner
@TheLastCoder: What would a key look like, for example for "Today is a good day"?Mongolic
yes! and it's values will be 2 since ['Today is a good day', 'Is today a good day'],Willner
@TheLastCoder: Yes what? What would be the common key for ['Today is a good day', 'Is today a good day']?Mongolic
Anything, maybe the one that comes firstWillner
@TheLastCoder: All the information you need is inside vocab = {'a day good is today': ['Today is a good day', 'Is today a good day'], 'a b c': ['a b c', 'c b a'], 'a a b c': ['a a b c']}. If you have a sentence, you just need to call vocab[sorted_words(sentence)] to get all the sentences with the same words, or vocab[sorted_words(sentence)][0] to get the first one.Mongolic
Y
2

To take duplicate/multiple words into account your equality comparison may be:

def hash_sentence(s):                                                                                                                                                                                                                                         
    return hash(''.join(sorted(s.split())))                                                                                                                                                                                                                   

a = 'today is a good day'                                                                                                                                                                                                                                     
b = 'is today a good day'                                                                                                                                                                                                                                     
c = 'today is a good day is a good day'                                                                                                                                                                                                                       

hash_sentence(a) == hash_sentence(b)  # True
hash_sentence(a) == hash_sentence(c)  # False

Also, note that in your implementation every sentence is counted n-times (for sentence in vocab:).

Yunyunfei answered 25/6, 2019 at 20:28 Comment(15)
I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster optionWillner
It's probably the way to go. You can then group the sentences by hash and get similar sentences directly.Mongolic
This is an elegant solution. Let me use the timeit function and get back to you! let me see if the set function or this implementation is fasterWillner
Time for this implementation : 31.8790001869, now running the plain old set implementationWillner
time taken using sets : 15.1610000134Willner
that is interesting... try to tuple the sorted sequence instead of joining to a string if you would...Yunyunfei
actually, depending on your input data, you may be able to even omit the .split() and just sort the string directly -- for the purpose of hashing that is.Yunyunfei
Its just plain text file with over 1000 lines of random text. Yeah, I can try thatWillner
Well it certainly performed faster than using join. time taken 26.7250001431 still slower than sets :/Willner
Out of despair a last one: sum(hash(x) for x in s.split()), else: built-ins for the win!!!Yunyunfei
@TheLastCoder: Could you please check https://mcmap.net/q/1462203/-check-if-two-strings-contain-the-same-set-of-words-in-python and see if it works for you ? Load all your file with sentences = file_ob.readlines() and run the code. I think it should complete in a few ms.Mongolic
@MarkusRother: Don't despair yet. Your code and idea are fine. I really think it's the way OP uses them that's the problem here.Mongolic
@MarkusRother: time taken 41.7990000248 :/Willner
@Willner wait, let me try it. Sorry for not responding.Willner
I am just going with sets. @MarkusRother, I have saved your code for referencing in the future.Willner
B
2

In your code, you can extract the Counter construction outside of the inner loop, instead of recalculating each for every pair - this should improve the algorithm by a factor proportional to the avg # of tokens per string.

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

Further improvements could be had by using the Counters as indices to a dictionary, which would let you replace the linear search for matching sentences with a lookup. The frozendict package would probably be useful so that you can use a dictionary as a key to another dictionary.

Biblicist answered 25/6, 2019 at 20:34 Comment(1)
I am sorry, I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster optionWillner

© 2022 - 2024 — McMap. All rights reserved.