W

4

7

I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?

EDIT:

I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.

Willner answered 25/6, 2019 at 20:22 Comment(4)

Do you need to distinguish duplicate words? Should to to match to to to? – Spinifex 25/6, 2019 at 20:26

If not, turn the list of words into a set and test if the two sets are equal. – Spinifex 25/6, 2019 at 20:27

Anything else I can use instead of sets? – Willner 25/6, 2019 at 20:58

Is there any reason why you don't want to change the implementation? – Mongolic 26/6, 2019 at 7:27

M

3

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}

Mongolic answered 25/6, 2019 at 21:1 Comment(9)

This actually works really fast. But could you elaborate on how I could make the above code faster. Just by changing the counter and use something else. Either user defined or in built function – Willner 25/6, 2019 at 23:34

I lose the order of words when I create a dictionary with the strings as keys. Yes I am able to get the count of similar sentences but then I lose the original order – Willner 26/6, 2019 at 12:18

@TheLastCoder: That's why I wrote the "more complex example". Anyway, there's a shorter version in "Yet another alternative". – Mongolic 26/6, 2019 at 12:28

I understand how the dictionary works. What I want is to have dictionary keys that are already in the text with the count equal to the number of similar strings(similar means have the same set of words) – Willner 26/6, 2019 at 15:10

@TheLastCoder: What would a key look like, for example for "Today is a good day"? – Mongolic 26/6, 2019 at 15:17

yes! and it's values will be 2 since ['Today is a good day', 'Is today a good day'], – Willner 26/6, 2019 at 15:29

@TheLastCoder: Yes what? What would be the common key for ['Today is a good day', 'Is today a good day']? – Mongolic 26/6, 2019 at 15:34

Anything, maybe the one that comes first – Willner 26/6, 2019 at 18:56

@TheLastCoder: All the information you need is inside

vocab = {'a day good is today': ['Today is a good day', 'Is today a good day'], 'a b c': ['a b c', 'c b a'], 'a a b c': ['a a b c']}

. If you have a sentence, you just need to call vocab[sorted_words(sentence)] to get all the sentences with the same words, or vocab[sorted_words(sentence)][0] to get the first one. – Mongolic 27/6, 2019 at 9:26

N

3

Try something like

set(sentence.split(" ")) == set(line.split(" "))

Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.

Your full code will look like

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

Neptunium answered 25/6, 2019 at 20:27 Comment(10)

There's really not much difference between a set, a dict and a counter. A set is basically a dict in which values are ignored. It's much better to use a O(1) or O(n) solution with counters than O(n**2) with sets. – Mongolic 25/6, 2019 at 20:49

I am sorry, I phrased the question for simplicity. In my actual code, the vocab is generated within the for loop. Basically I am generating ngrams from a text file and ensuring that no two ngram has the same set of words. Converting them to set actually worked but it's still slow. I was wondering if there was a faster option – Willner 25/6, 2019 at 20:51

@EricDuminil Is there anything that I can use instead of sets? – Willner 25/6, 2019 at 20:57

@EricDuminil The counter solution has to compare both the keys and values, the set only has to compare keys. They're both O(n). – Spinifex 25/6, 2019 at 21:1

@TheLastCoder: sets and counters are perfectly fine. You just need to find the correct keys and use sets the way they're supposed to be used : not iterating over every key. – Mongolic 25/6, 2019 at 21:2

@Barmar: The problem here is the structure of the code and that there's one unneeded loop. It doesn't make any noticeable difference at all if both keys and values are compared, or just the keys. OP's code seems to be at least O(n**2), but could be O(n). – Mongolic 25/6, 2019 at 21:16

@Eric I told you, this is not the actual code I am running, I am generating the vocab dictionary withing the loop, so I have to put it inside. I am really sorry for the confusion. Wait, let me edit the question. By far set has been the fastest implementation – Willner 25/6, 2019 at 21:25

@Eric Please check the code again. Hope it makes sense. Do ask me if you need any clarification. – Willner 25/6, 2019 at 21:30

Edited the solution. Thank you for all your inputs. – Neptunium 25/6, 2019 at 21:47

Sorry for being blunt, but Counter wasn't the problem, and your code isn't the solution either. There simply shouldn't be two for loops. – Mongolic 26/6, 2019 at 7:29

M

3