I'm trying to solve the Top K Frequent Words Leetcode problem in O(N log K) time and am getting an undesirable result. My Python3 code and console output are below:
from collections import Counter
import heapq
class Solution:
def topKFrequent(self, words: List[str], k: int) -> List[str]:
counts = Counter(words)
print('Word counts:', counts)
result = []
for word in counts:
print('Word being added:', word)
if len(result) < k:
heapq.heappush(result, (-counts[word], word))
print(result)
else:
heapq.heappushpop(result, (-counts[word], word))
result = [r[1] for r in result]
return result
----------- Console output -----------
Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]
When I run the test case ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"]
with K = 4
, I find that the word the
gets shifted to the end of the list (after day
) once is
is added even though they both have a count of 3. This makes sense since the parent need only be <= the children and the children are not ordered in any way. Since (-2, 'sunny')
and (-3, 'the')
are both > (-3, 'is')
, the heap invariant is, in fact, maintained even though (-3, 'the')
< (-2, 'sunny')
and is the right child of (-3, 'is')
. The expected result is ["is","the","sunny","day"]
while the output of my code is ["is","sunny","the","day"]
.
Should I be using heaps to solve this problem in O(N log K) time, and if so, how can I modify my code to achieve the desired result?
words
are unique, if you push all of them onto the heap, won't this be O(N log N) rather than O(N log K)? – Harri