Lower bound on heapsort?
Asked Answered
R

3

18

It's well-known that the worst-case runtime for heapsort is Ω(n lg n), but I'm having trouble seeing why this is. In particular, the first step of heapsort (making a max-heap) takes time Θ(n). This is then followed by n heap deletions. I understand why each heap deletion takes time O(lg n); rebalancing the heap involves a bubble-down operation that takes time O(h) in the height of the heap, and h = O(lg n). However, what I don't see is why this second step should take Ω(n lg n). It seems like any individual heap dequeue wouldn't necessarily cause the node moved to the top to bubble all the way down the tree.

My question is - does anyone know of a good lower-bound proof for the best-case behavior of heapsort?

Reprint answered 4/1, 2011 at 1:42 Comment(4)
Would a best-case be an already sorted list? :)Khalkha
Probably better-suited on cstheory.stackexchange.comParallelepiped
@marcog Not cstheory for basic DS/under-grad questions :-/Cynthea
The basic idea is that the heap's size doubles in each level, so at least half of its elements are in the lowest level.(and 3/4 in the lowest two) Therefore, it is very likely that all "bubbles" need to go very far down in the heap.Annice
R
20

So I did a bit of digging myself and it looks like this result actually is fairly recent! The first lower-bound proof I can find is from 1992, though heapsort itself was invented in 1964.

The formal lower-bound proof is due to Schaffer and Sedgewick's "The Analysis of Heapsort" paper. Here's a slightly paraphrased version of the proof that omits some of the technical details.

To begin, let's suppose that n = 2k - 1 for some k, which guarantees that we have a complete binary heap. I'll show how to handle this case separately later on. Because we have 2k - 1 elements, the first pass of heapsort will, in Θ(n), build up a heap of height k. Now, consider the first half of the dequeues from this heap, which removes 2k-1 nodes from the heap. The first key observation is that if you take the starting heap and then mark all of the nodes here that actually end up getting dequeued, they form a subtree of the heap (i.e. every node that get dequeued has a parent that also gets dequeued). You can see this because if this weren't the case, then there would be some node whose (larger) parent didn't get dequeued though the node itself was dequeued, meaning that the values are out of order.

Now, consider how the nodes of this tree are distributed across the heap. If you label the levels of the heap 0, 1, 2, ..., k - 1, then there will be some number of these nodes in levels 0, 1, 2, ..., k - 2 (that is, everything except the bottom level of the tree). In order for these nodes to get dequeued from the heap, then they have to get swapped up to the root, and they only get swapped up one level at a time. This means that one way to lower-bound the runtime of heapsort would be to count the number of swaps necessary to bring all of these values up to the root. In fact, that's exactly what we're going to do.

The first question we need to answer is - how many of the largest 2k-1 nodes are not in the bottom level of the heap? We can show that this is no greater than 2k-2 by contradiction. Suppose that there are at least 2k-2 + 1 of the largest nodes in the bottom level of the heap. Then each of the parents of those nodes must also be large nodes in level k - 2. Even in the best case, this means that there must be at least 2k-3 + 1 large nodes in level k - 2, which then means that there would be at least 2k-4 + 1 large nodes in level k - 3, etc. Summing up over all of these nodes, we get that there are 2k-2 + 2k-3 + 2k-4 + ... + 20 + k large nodes. But this value is strictly greater than 2k-1, contradicting the fact that we're working with only 2k-1 nodes here.

Okay... we now know that there are at most 2k-2 large nodes in the bottom layer. This means that there must be at least 2k-2 of the large nodes in the first k-2 layers. We now ask - what is the sum, over all of these nodes, of the distance from that node to the root? Well, if we have 2k-2 nodes positioned somewhere in a complete heap, then at most 2k-3 of them can be in the first k - 3 levels, and so there are at least 2k-2 - 2k-3 = 2k-3 heavy nodes in level k - 2. Consequently, the total number of swaps that need to be performed are at least (k - 2) 2k-3. Since n = 2k-1, k = Θ(lg n), and so this value is Θ(n lg n) as required.

Reprint answered 4/1, 2011 at 21:3 Comment(4)
Should the first question of the second last paragraph be - how many of the largest 2^(k-1) nodes that are in the bottom level of the heap? Or there seemed to be some thing wrong in your suppose sentence for contradiction.Verbenaceous
In the last paragraph, shouldn't "the first k-2 layers" actually be "the first k-1 layers"? Also, the sentence, also in the last paragraph, "then at most 2^(k-3) of them can be in the first k-3 levels" I'm completely confused by. Does this sentence come from repeating the proof in second-to-last paragraph? If so, the sentence should actually read "then at least 2^(k-3) of them can be in the first k-2 levels". But if so, then this entire answer falls apart?Popliteal
I believe you meant to say that, since the first k-3 levels contain exactly 2^(k-3)-1 elements, the 2^(k-2) large elements above the bottom layer must have at least 2^(k-2)-2^(k-3)+1=2^(k-3)+1 located in either level k-2 or level k-3. Hence, each of 2^(k-3)+1 large elements require at least k-3 swaps to get to the root.Popliteal
This explanation is only swaps required for k-2 level. Should not we take sum for further inner levels?Flutter
A
3

Simple observation answer is this: The items in heap are:

1
2
4
8
...
2^[log(n/4)]
and last level has between (1..2^[log(n/2)]) ==> (1,[n/2]) item, (by [] I mean Ceiling not roof)

for example if you have 7 item:

1
2
4

and if you have 8 item:

1
2
4
1

There is 2 different heap tree, first at least n/4 - 1 items of a heap are in last level, or not, so there is at least n/4 - 1 item in level before last one, In the first case it takes O((n/4 - 1) * log(n/2)) to remove last level items from heap, and in the second case it takes O((n/4 - 1) * log(n/4)) to remove items from pre last level. so in both case it takes Ω(n log(n)) just for n/4 - 1 items, so it's a lower bound (easily can say it's tight lower bound).

Appreciation answered 4/1, 2011 at 7:58 Comment(0)
K
1

Here is a solution that uses CLRS terms:
We start with a max-heap that is a complete binary tree with n elements.
We can say that in a complete binary there are n/2 leaves and n/2 inner nodes.
n/2 iterations of HEAP-SORT remove the largest n/2 elements from the heap.
Let S be the set of the largest n/2 elements.
There can be at most n/4 elements from S in the leaves since there must be additional n/4 of them in the inner nodes.
Let L be these n/4 largest elements from S that are in the leaves.
So if there are n/4 elements from S at level 0 (the leaves level) then there must be at least n/8 of them at level 1.
Let P be these n/8 elements from S that are at level 1.
n/2 iterations of HEAP-SORT may give the elements from L a short cut to the root and then out of the heap, but the elements from P must make all the way to the root before they are removed from the heap.
So there are at least (n/8)(lgn-1) operations, which gives us a running time of Ω(nlgn).
Now for the case of a max-heap that doesn't have all its leaves at level 0.
Let k be the number of its leaves at level 0.
After k iterations of HEAP-SORT, we are left with a max-heap that is a complete binary tree with height lgn-1.
We can continue our proof the same way.
Now for the case when there are less than n/4 leaves from S.
Let k be the number of elements from S that are in the leaves at level 0.
If k <= n/8 then there must be at least n/8 elements from S at level 1.
This is because there can be a total of n/4 elements above level 1.
We continue the proof the same way.
If k>n/8 then there must be at least n/16 elements from S that are at level 1.
We continue the proof the same way.
We conclude that the running time of HEAP-SORT is Ω(nlgn).

Kowloon answered 8/2, 2012 at 15:24 Comment(1)
In this explanation we only consider the max elements in level 1. What about further inner levels? And there also will be contribution of n/4 for finding the max nodes which are there in the leaves right?Flutter

© 2022 - 2024 — McMap. All rights reserved.