Top K Frequent Words using heaps in Python [duplicate]
Asked Answered
H

3

7

I'm trying to solve the Top K Frequent Words Leetcode problem in O(N log K) time and am getting an undesirable result. My Python3 code and console output are below:

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        
        counts = Counter(words)
        print('Word counts:', counts)
        
        result = []
        for word in counts:
            print('Word being added:', word)
            if len(result) < k:
                heapq.heappush(result, (-counts[word], word))
                print(result)
            else:
                heapq.heappushpop(result, (-counts[word], word))
        result = [r[1] for r in result]
        
        return result

----------- Console output -----------

Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]

When I run the test case ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"] with K = 4, I find that the word the gets shifted to the end of the list (after day) once is is added even though they both have a count of 3. This makes sense since the parent need only be <= the children and the children are not ordered in any way. Since (-2, 'sunny') and (-3, 'the') are both > (-3, 'is'), the heap invariant is, in fact, maintained even though (-3, 'the') < (-2, 'sunny') and is the right child of (-3, 'is'). The expected result is ["is","the","sunny","day"] while the output of my code is ["is","sunny","the","day"].

Should I be using heaps to solve this problem in O(N log K) time, and if so, how can I modify my code to achieve the desired result?

Harri answered 11/11, 2020 at 0:0 Comment(0)
C
7

You're on the right track with using heapq and Counter you just need to make a slight modification in how you are using them in relation to k: (you need to iterate the whole of counts before adding anything to result):

from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        counts = collections.Counter(words)
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, (-val, key))
        
        result = []
        while k > 0:
            result.append(heapq.heappop(max_heap)[1])
            k -= 1
        
        return result

Did not read the requirement of being O(N log k) before, here is a modification to the above solution to achieve that:

from collections import Counter, deque
import heapq

class WordWithFrequency(object):
    def __init__(self, word, frequency):
        self.word = word
        self.frequency = frequency

    def __lt__(self, other):
        if self.frequency == other.frequency:
            return lt(other.word, self.word)
        else:
            return lt(self.frequency, other.frequency)

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:    
        counts = collections.Counter(words)
        
        max_heap = []
        for key, val in counts.items():
            heapq.heappush(max_heap, WordWithFrequency(key, val))
            if len(max_heap) > k:
                heapq.heappop(max_heap)
        
        result = deque([]) # can also use a list and just reverse at the end
        while k > 0:
            result.appendleft(heapq.heappop(max_heap).word)
            k -= 1
        
        return list(result)
Corinthian answered 11/11, 2020 at 0:12 Comment(2)
Thank you for your answer. Assuming K < N and all elements of words are unique, if you push all of them onto the heap, won't this be O(N log N) rather than O(N log K)?Harri
The max_heap will have at any point only K + 1 elements. It'll perform the heappop with respect to K + 1Chiarra
B
3

You don't need to bother with a heap. Counter() already has a method to return the most common elements.

>>> c = Counter(["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"])
>>> c.most_common(4)
[('the', 3), ('is', 3), ('sunny', 2), ('day', 1)]
Beezer answered 11/11, 2020 at 0:16 Comment(4)
+1, but for context, the platform OP is using, Leetcode is for interview preparation, and generally interviewers don't allow you to library functions that do the core aspect of the question. E.g. If a problem has nothing to do with binary search I bet most interviewers would allow you to use bisect.bisect_left, but most_common would be stretching it for this question, imo, unless its for a Python specialist role.Corinthian
@ShashSinha I didn't realize the context. On the other hand, back when I did interviews, I asked a specific question that always got long complicated answers. One candidate just replied "I'd put them in a heap." and that's all I needed to hear!Beezer
Does it have the desired time complexity?Vasos
@superbrain. Counter uses a heap to find the n largest. So I suspect it uses the standard N log n algorithm, where N is the number of items you have, and n is the "I want the n largest".Beezer
I
-3

Using Counter() and sort(), this'd also pass on O(N Log N):

class Solution:
    def topKFrequent(self, words, k):
        words_countmap = collections.Counter(words)
        items = list(words_countmap.items())
        items.sort(key=lambda item: (-item[1], item[0]))
        return [item[0] for item in items[0:k]]

Here is a Java Solution using PriorityQueue:

class Solution {
    public static final List<String> topKFrequent(
        final String[] words,
        final int k
    ) {
        LinkedList<String> frequentWords = new LinkedList<>();
        HashMap<String, Integer> wordsMap = new HashMap<>();

        for (int i = 0; i < words.length; i++) {
            wordsMap.put(words[i], wordsMap.getOrDefault(words[i], 0) + 1);
        }

        PriorityQueue<Map.Entry<String, Integer>> wordCounterQueue = new PriorityQueue<>(
            (a, b) -> a.getValue() == b.getValue() ? b.getKey().compareTo(a.getKey()) : a.getValue() -
            b.getValue()
        );

        for (Map.Entry<String, Integer> key : wordsMap.entrySet()) {
            wordCounterQueue.offer(key);

            if (wordCounterQueue.size() > k) {
                wordCounterQueue.poll();
            }
        }

        while (!wordCounterQueue.isEmpty()) {
            frequentWords.add(0, wordCounterQueue.poll().getKey());
        }

        return frequentWords;
    }
}

Similarly with C++:

// Most of headers are already included;
// Can be removed;
#include <iostream>
#include <cstdint>
#include <vector>
#include <string>
#include <unordered_map>
#include <queue>
#include <utility>

// The following block might slightly improve the execution time;
// Can be removed;
static const auto __optimize__ = []() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(nullptr);
    std::cout.tie(nullptr);
    return 0;
}();




struct Solution {
    using ValueType = std::uint_fast16_t;
    using Pair = std::pair<std::string, ValueType>;
    static const std::vector<std::string> topKFrequent(
        const std::vector<std::string>& words,
        const ValueType k
    ) {
        std::unordered_map<std::string, ValueType> word_counts;

        for (const auto& word : words) {
            ++word_counts[word];
        }

        std::priority_queue<Pair, std::vector<Pair>, Comparator> word_freqs;

        for (const auto& word_count : word_counts) {
            word_freqs.push(word_count);


            if (std::size(word_freqs) > k) {
                word_freqs.pop();
            }
        }

        std::vector<std::string> k_frequent;

        while (!word_freqs.empty()) {
            k_frequent.emplace_back(word_freqs.top().first);
            word_freqs.pop();
        }

        std::reverse(std::begin(k_frequent), std::end(k_frequent));

        return k_frequent;
    }

  private:
    struct Comparator {
        Comparator() {}
        ~Comparator() {}
        const bool operator()(
            const Pair& a,
            const Pair& b
        ) {
            return a.second > b.second || (a.second == b.second && a.first < b.first);
        }
    };
};

int main() {
    std::vector<std::string> words = {"i", "love", "leetcode", "i", "love", "coding"};
    std::vector<std::string> top_k_frequents = Solution().topKFrequent(words, 2);

    for (const auto& word : top_k_frequents) {
        std::cout << word << "\n";
    }
}


Illegalize answered 11/11, 2020 at 2:6 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.