Top K Frequent Words using heaps in Python [duplicate]

Asked 11/11, 2020 at 0:0 Answered 11/11, 2020 at 2:6

I'm trying to solve the Top K Frequent Words Leetcode problem in O(N log K) time and am getting an undesirable result. My Python3 code and console output are below:

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        
        counts = Counter(words)
        print('Word counts:', counts)
        
        result = []
        for word in counts:
            print('Word being added:', word)
            if len(result) < k:
                heapq.heappush(result, (-counts[word], word))
                print(result)
            else:
                heapq.heappushpop(result, (-counts[word], word))
        result = [r[1] for r in result]
        
        return result

----------- Console output -----------

Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]

When I run the test case ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] with K = 4, I find that the word the gets shifted to the end of the list (after day) once is is added even though they both have a count of 3. This makes sense since the parent need only be <= the children and the children are not ordered in any way. Since (-2, 'sunny') and (-3, 'the') are both > (-3, 'is'), the heap invariant is, in fact, maintained even though (-3, 'the') < (-2, 'sunny') and is the right child of (-3, 'is'). The expected result is ["is","the","sunny","day"] while the output of my code is ["is","sunny","the","day"].

Should I be using heaps to solve this problem in O(N log K) time, and if so, how can I modify my code to achieve the desired result?

Harri answered 11/11, 2020 at 0:0 Comment(0)

You're on the right track with using heapq and Counter you just need to make a slight modification in how you are using them in relation to k: (you need to iterate the whole of counts before adding anything to result):

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        counts = collections.Counter(words)
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, (-val, key))
        
        result = []
        while k > 0:
            result.append(heapq.heappop(max_heap)[1])
            k -= 1
        
        return result

Did not read the requirement of being O(N log k) before, here is a modification to the above solution to achieve that:

from collections import Counter, deque
import heapq

class WordWithFrequency(object):
    def __init__(self, word, frequency):
        self.word = word
        self.frequency = frequency

    def __lt__(self, other):
        if self.frequency == other.frequency:
            return lt(other.word, self.word)
        else:
            return lt(self.frequency, other.frequency)

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:    
        counts = collections.Counter(words)
        
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, WordWithFrequency(key, val))
            if len(max_heap) > k:
                heapq.heappop(max_heap)
        
        result = deque([]) # can also use a list and just reverse at the end
        while k > 0:
            result.appendleft(heapq.heappop(max_heap).word)
            k -= 1
        
        return list(result)

Corinthian answered 11/11, 2020 at 0:12 Comment(2)

Thank you for your answer. Assuming K < N and all elements of words are unique, if you push all of them onto the heap, won't this be O(N log N) rather than O(N log K)? – Harri 11/11, 2020 at 0:34

The max_heap will have at any point only K + 1 elements. It'll perform the heappop with respect to K + 1 – Chiarra 3/2, 2021 at 7:40

You don't need to bother with a heap. Counter() already has a method to return the most common elements.

>>> c = Counter(["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"])
>>> c.most_common(4)
[('the', 3), ('is', 3), ('sunny', 2), ('day', 1)]

Beezer answered 11/11, 2020 at 0:16 Comment(4)

+1, but for context, the platform OP is using, Leetcode is for interview preparation, and generally interviewers don't allow you to library functions that do the core aspect of the question. E.g. If a problem has nothing to do with binary search I bet most interviewers would allow you to use bisect.bisect_left, but most_common would be stretching it for this question, imo, unless its for a Python specialist role. – Corinthian 11/11, 2020 at 0:23

@ShashSinha I didn't realize the context. On the other hand, back when I did interviews, I asked a specific question that always got long complicated answers. One candidate just replied "I'd put them in a heap." and that's all I needed to hear! – Beezer 11/11, 2020 at 0:46

Does it have the desired time complexity? – Vasos 18/11, 2020 at 16:37

@superbrain. Counter uses a heap to find the n largest. So I suspect it uses the standard N log n algorithm, where N is the number of items you have, and n is the "I want the n largest". – Beezer 18/11, 2020 at 17:21

-3

Using Counter() and sort(), this'd also pass on O(N Log N):

class Solution:
    def topKFrequent(self, words, k):
        words_countmap = collections.Counter(words)
        items = list(words_countmap.items())
        items.sort(key=lambda item: (-item[1], item[0]))
        return [item[0] for item in items[0:k]]

Here is a Java Solution using PriorityQueue:

class Solution {
    public static final List<String> topKFrequent(
        final String[] words,
        final int k
    ) {
        LinkedList<String> frequentWords = new LinkedList<>();
        HashMap<String, Integer> wordsMap = new HashMap<>();

        for (int i = 0; i < words.length; i++) {
            wordsMap.put(words[i], wordsMap.getOrDefault(words[i], 0) + 1);
        }

        PriorityQueue<Map.Entry<String, Integer>> wordCounterQueue = new PriorityQueue<>(
            (a, b) -> a.getValue() == b.getValue() ? b.getKey().compareTo(a.getKey()) : a.getValue() -
            b.getValue()
        );

        for (Map.Entry<String, Integer> key : wordsMap.entrySet()) {
            wordCounterQueue.offer(key);

            if (wordCounterQueue.size() > k) {
                wordCounterQueue.poll();
            }
        }

        while (!wordCounterQueue.isEmpty()) {
            frequentWords.add(0, wordCounterQueue.poll().getKey());
        }

        return frequentWords;
    }
}

Similarly with C++:

// Most of headers are already included;
// Can be removed;
#include <iostream>
#include <cstdint>
#include <vector>
#include <string>
#include <unordered_map>
#include <queue>
#include <utility>

// The following block might slightly improve the execution time;
// Can be removed;
static const auto __optimize__ = []() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(nullptr);
    std::cout.tie(nullptr);
    return 0;
}();




struct Solution {
    using ValueType = std::uint_fast16_t;
    using Pair = std::pair<std::string, ValueType>;
    static const std::vector<std::string> topKFrequent(
        const std::vector<std::string>& words,
        const ValueType k
    ) {
        std::unordered_map<std::string, ValueType> word_counts;

        for (const auto& word : words) {
            ++word_counts[word];
        }

        std::priority_queue<Pair, std::vector<Pair>, Comparator> word_freqs;

        for (const auto& word_count : word_counts) {
            word_freqs.push(word_count);


            if (std::size(word_freqs) > k) {
                word_freqs.pop();
            }
        }

        std::vector<std::string> k_frequent;

        while (!word_freqs.empty()) {
            k_frequent.emplace_back(word_freqs.top().first);
            word_freqs.pop();
        }

        std::reverse(std::begin(k_frequent), std::end(k_frequent));

        return k_frequent;
    }

  private:
    struct Comparator {
        Comparator() {}
        ~Comparator() {}
        const bool operator()(
            const Pair& a,
            const Pair& b
        ) {
            return a.second > b.second || (a.second == b.second && a.first < b.first);
        }
    };
};

int main() {
    std::vector<std::string> words = {"i", "love", "leetcode", "i", "love", "coding"};
    std::vector<std::string> top_k_frequents = Solution().topKFrequent(words, 2);

    for (const auto& word : top_k_frequents) {
        std::cout << word << "\n";
    }
}

Illegalize answered 11/11, 2020 at 2:6 Comment(0)

Recommended topics

Hot tags