What is the time complexity of heapq.nlargest?
Asked Answered
E

4

45

I was looking at this pycon talk, 34:30 and the speaker says that getting the t largest elements of a list of n elements can be done in O(t + n).

How is that possible? My understanding is that creating the heap will be O(n), but what's the complexity of nlargest itself, is it O(n + t) or O(t) (and what's the actual algorithm)?

Entourage answered 13/4, 2014 at 3:24 Comment(3)
You might be interested in the source code.Carlisle
If you want it in sorted order, obviously that's not going to happen in linear time. Otherwise, you could call nlargest with t=n to comparison sort a list in linear time. If you just want the t largest elements in any order, that can be done in O(n) with quickselect. heapq.nlargest doesn't use quickselect, though; it gives the items in sorted order with a heap-based algorithm.Skaggs
Just a general note: The claim that it takes time O(t + n) itself strikes me as wary, because that is just O(n). It's not technically incorrect but somewhat strange to express it that wayNitrous
C
61

The speaker is wrong in this case. The actual cost is O(n * log(t)). Heapify is called only on the first t elements of the iterable. That's O(t), but is insignificant if t is much smaller than n. Then all the remaining elements are added to this "little heap" via heappushpop, one at a time. That takes O(log(t)) time per invocation of heappushpop. The length of the heap remains t throughout. At the very end, the heap is sorted, which costs O(t * log(t)), but that's also insignificant if t is much smaller than n.

Fun with Theory ;-)

There are reasonably easy ways to find the t'th-largest element in expected O(n) time; for example, see here. There are harder ways to do it in worst-case O(n) time. Then, in another pass over the input, you could output the t elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in O(n) time.

But those ways require O(n) memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is O(t), and that can be very significant when the input is, for example, a generator producing a great many values.

Conservatoire answered 13/4, 2014 at 3:35 Comment(7)
Great that makes sense; I was really hoping O(t + n) was right though, I thought I'd learn about some new heap wizardry :)Entourage
See the edit just now for an O(n) method - but it has nothing to do with heaps, alas.Conservatoire
Fun fact: You can in fact heapify the array in O(n) and fetch the top-k of the resulting heap in O(k) time per query. It's highly non-trivial though and the heapq module does not implement it. (It also probably has gigantic constant factors that make it infeasible in practice)Nitrous
@NiklasB. where can I read about this O(k) algorithm? Even if non-trivial I'm super interested!Entourage
@Entourage #22575080Nitrous
@NiklasB. my guess is that will require O(n+t) in space vs the heapq which just uses O(t) in space, when t<<n and n is coming from a generator we save a lot of space for almost equal runtimes. Not sure if it makes sense to have a non default option to avoid the sort at the end if the caller doesn't really care about the t elements being sorted.According
I think adding remaining elements to the "little heap" costs O((n-t)log(t)), and the sorting at the very end takes O(t log(t)). Neither is ignored at last. They sum up to O(n log(t)). Am I right?Ptarmigan
F
9

For Heapq t largest or t smallest, the time complexity will be O(nlog(t))

Heapq will build the heap for the first t elements, then later on it will iterate over the remaining elements by pushing and popping the elements from the heap (maintaining the t elements in the heap).

  1. For building the heap for first t elements will be done tlog(t)
  2. For pushing and popping, the remaining elements will be done in (n-t)log(t)
  3. The overall time complexity will be nlog(t)
Fishy answered 24/10, 2021 at 0:18 Comment(0)
D
4

It's actually O(n + t⋅log n) because heapify takes O(n) and for every element of largest or smallest takes O(log n). So for t largest/smallest it takes t⋅log n. Therefore, the time complexity will be O(n + t⋅log n).

Diachronic answered 28/6, 2021 at 14:0 Comment(0)
S
3

According to the comments in the heapq source code:

# Algorithm notes for nlargest() and nsmallest()
# ==============================================
#
# Make a single pass over the data while keeping the k most extreme values
# in a heap.  Memory consumption is limited to keeping k values in a list.
#

# Theoretical number of comparisons for k smallest of n random inputs:
#
# Step   Comparisons                  Action
# ----   --------------------------   ---------------------------
#  1     1.66 * k                     heapify the first k-inputs
#  2     n - k                        compare remaining elements to top of heap
#  3     k * (1 + lg2(k)) * ln(n/k)   replace the topmost value on the heap
#  4     k * lg2(k) - (k/2)           final sort of the k most extreme values
#
# Combining and simplifying for a rough estimate gives:
#
#        comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))
Scop answered 12/6, 2022 at 13:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.