Finding the median of an unsorted array
Asked Answered
S

9

71

To find the median of an unsorted array, we can make a min-heap in O(nlogn) time for n elements, and then we can extract one by one n/2 elements to get the median. But this approach would take O(nlogn) time.

Can we do the same by some method in O(n) time? If we can, then please tell or suggest some method.

Schuss answered 19/5, 2012 at 3:13 Comment(5)
possible duplicate of How to find the kth largest element in an unsorted array of length n in O(n)?Winniewinnifred
Keep in mind that if it takes O(nlogn) then you might as well just sort the array and divide the index by 2.Articulation
building heap takes O(n) time not O(nlogn)Arsphenamine
@JerryGoyal, If you have all elements at the same time, then building a heap takes O(n). But if you have stream of elements then, it takes O(nlogn). Its like pushing one element at time, and n times. So, I guess he means stream of elements here.Gerek
(@GorvGoyl: extract one by one n/2 elements takes O(nlogn) time.)Interoffice
G
52

You can use the Median of Medians algorithm to find median of an unsorted array in linear time.

Gratian answered 19/5, 2012 at 3:23 Comment(8)
@KevinKostlan It actually isn't approximate, it's the real median and it finds it in linear time. Notice that after finding the median of medians (which is guaranteed to be greater than at least 30% of the elements and smaller than at least 30% of the elements) you partition the the array using that pivot. Then you recurse (if necessary) into one of those array which is at most %70 the size of the original array in order to find the real median (or in the general case the k-statistic).Pro
@dcmm88: Please read [en.wikipedia.org/wiki/Median_of_medians]. In linear time, the best you can get is a good guess. (The moment you recurse you're no longer O(n)/linear - by definition.)Redtop
@Redtop the wikipedia page you linked specifically say that it is. en.wikipedia.org/wiki/…Pro
@Pro Read the first sentence of the article again. MoM is O(n) and approximate. When you prescribe recursive repetition of a linear operation over (subsets of) a dataset to obtain a "real median", you are specifying a new algorithm, with greater time complexity, by definition.Redtop
@Redtop excuse me, I misinterpreted the reply. I thought approximate was referring to the complexity, not the accuracy. Nevertheless, you can still use median of medians to find the true median in O(n), it's just that the wikipedia page doesn't explain this. I hinted at the solution in my previous reply, and you can find a more detailed explanation here, from https://mcmap.net/q/117100/-how-to-find-the-kth-largest-element-in-an-unsorted-array-of-length-n-in-o-n. Basically since you are recursing into a set that is 70% size of previous step, you get a geometric series that sums to some constant times O(n).Pro
In O(n) time, you will only find a near median, not actual median. [Edited by moderator to remove rude language]Dulci
Reading the full first paragraph of the Wikipedia link, the situation seems to be this: The Median of Medians algorithm (which finds the median of every group of 5, then recurses on those values) gives an approximate median in linear time, guaranteed to lie within the 30% and 70% percentiles. This can be used together with QuickSelect (like QuickSort but only recursively sorts the part holding the desired element) to find the exact median (at any desired percentile level) in linear time, even in the worst case. So this answer is sort of correct, but not quite the whole story.Jurado
@dcmm88: I got a surprise up-vote for my answer and was prompted to read the paper you linked in your comment above for the 1st time. Pg 3 starts "The total running time of finding Kth largest item in a list of N items is O(nlogn)" (after saying in the middle of pg 2 "Unfortunately... finding the median doesn’t seem... much simpler than finding the kth largest element"). The lesson I guess, is that voting is not objective, only reflects the wishes of people and cannot be trusted to always find the right answers. Your thoughts?Redtop
A
15

I have already upvoted the @dasblinkenlight answer since the Median of Medians algorithm in fact solves this problem in O(n) time. I only want to add that this problem could be solved in O(n) time by using heaps also. Building a heap could be done in O(n) time by using the bottom-up. Take a look to the following article for a detailed explanation Heap sort

Supposing that your array has N elements, you have to build two heaps: A MaxHeap that contains the first N/2 elements (or (N/2)+1 if N is odd) and a MinHeap that contains the remaining elements. If N is odd then your median is the maximum element of MaxHeap (O(1) by getting the max). If N is even, then your median is (MaxHeap.max()+MinHeap.min())/2 this takes O(1) also. Thus, the real cost of the whole operation is the heaps building operation which is O(n).

BTW this MaxHeap/MinHeap algorithm works also when you don't know the number of the array elements beforehand (if you have to resolve the same problem for a stream of integers for e.g). You can see more details about how to resolve this problem in the following article Median Of integer streams

Askins answered 11/4, 2015 at 12:52 Comment(3)
Why does this work? Suppose your array is [3, 2, 1]. We would then put the first 2 in a max heap: [3, 2], thus 3 would be the root, so that 2, its child must be smaller than it. And, we would have [1] in the min heap. According to this algorithm, we would then choose the max (root), of the maxHeap as our median. Wouldn't this give us 3?Mellott
It's O(n^2) time worse case, not O(n). When referring to an algorithm's Big O complexity, without specifying the case, it's typically assumed that you're referring to the worse time.Howes
Yeah the answer given is wrong, he said first n/2 elements needs to be added that's not true, in reality you have to add first n/2 (or n/2 +1 if n is odd) smallest element in Max heap and rest in Min heap hence it would ensure correct answer. Follow the link he provided below "Median of integer stream"Recycle
T
11

Quickselect works in O(n), this is also used in the partition step of Quicksort.

Tarkany answered 19/5, 2012 at 3:24 Comment(4)
I don't think quickselect would necessarily give the median in ONLY ONE run. It depends on your pivot choice.Bouillabaisse
Unfortunately, quickselect to find median will take O(n^2) in worst case. This occurs when we reduce the array by just 1 element in each iteration of QuickSelect. Consider an already sorted array and we always choose right most element as pivot. I know this is bit foolish to do so but this is how worst cases are.Actinal
@VishalSahu you are wrong. Quickselect runs in O(n), because it always chooses a good pivotOvi
Quickselect is between O(n) and O(n^2).Antiknock
P
10

The quick select algorithm can find the k-th smallest element of an array in linear (O(n)) running time. Here is an implementation in python:

import random

def partition(L, v):
    smaller = []
    bigger = []
    for val in L:
        if val < v: smaller += [val]
        if val > v: bigger += [val]
    return (smaller, [v], bigger)

def top_k(L, k):
    v = L[random.randrange(len(L))]
    (left, middle, right) = partition(L, v)
    # middle used below (in place of [v]) for clarity
    if len(left) == k:   return left
    if len(left)+1 == k: return left + middle
    if len(left) > k:    return top_k(left, k)
    return left + middle + top_k(right, k - len(left) - len(middle))

def median(L):
    n = len(L)
    l = top_k(L, n / 2 + 1)
    return max(l)
Platt answered 31/3, 2014 at 0:58 Comment(2)
How is this linear? If I understand correctly this implementation is O(n^2) in worst case.Embassy
@Embassy It's "expected value" linear time because of the randomness. The intuition is that the random index will, on average, split the list into a list of 1/4 size and of 3/4 size.Einstein
R
2

No, there is no O(n) algorithm for finding the median of an arbitrary, unsorted dataset.
At least none that I am aware of in 2022. All answers offered here are variations/combinations using heaps, Median of Medians, Quickselect, all of which are strictly O(nlogn).

  1. See https://en.wikipedia.org/wiki/Median_of_medians and http://cs.indstate.edu/~spitla/abstract2.pdf.

  2. The problem appears to be confusion about how algorithms are classified, which is according their limiting (worst case) behaviour. "On average" or "typically" O(n) with "worst case" O(f(n)) means (in textbook terms) "strictly O(f(n))". Quicksort for example, is often discussed as being O(nlogn) (which is how it typically performs), although it is in fact an O(n^2) algorithm because there is always some pathological ordering of inputs for which it can do no better than n^2 comparisons.

Redtop answered 24/4, 2019 at 15:26 Comment(3)
@Interoffice Thanks for interest. A couple of weeks ago I was coding up a kth largest routine while hobbying in a new language. Coincidentally this week got a brownie point for my answer here. I added a spontaneous comment, then checked my facts, then tweaked my answer (got a bit OCD;-). This is about algorithm classification and Wikipedia has it right in both cases - introspect O(nlogn) and Quickselect O(n^2) (clearer on non-mobile version of your linked pages, 1st link broken BTW). There is still no strictly linear solution (yet, AFAIK).Redtop
Well, en.wikipedia on quickselect variants states One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. Too many disparate things seem to have been labelled introselect, I was thinking of the variant starting quickselect style, but monitoring partition performance and switching to median-of-medians pivot selection when partition looked bad, only.Interoffice
@Interoffice Adding an expensive-to-calculate pivot adds order(s) of magnitude to the average cost as a function of n (comparisons), making average/typical performance a LOT worse. Practically it is usually best to use a worst-case O(n^2) algorithm with fast pivot because the probability of encountering worst-case conditions is increasingly rare with larger datasets. Howzabout we let this rest with "there is no practical O(n) algorithm (yet) and the answers on this page are all strictly O(nlogn) by the definition of Big O notation en.wikipedia.org/wiki/Big_O_notation"Redtop
P
0

It can be done using Quickselect Algorithm in O(n), do refer to Kth order statistics (randomized algorithms).

Peria answered 19/5, 2012 at 8:12 Comment(0)
D
0

As wikipedia says, Median-of-Medians is theoretically o(N), but it is not used in practice because the overhead of finding "good" pivots makes it too slow.
http://en.wikipedia.org/wiki/Selection_algorithm

Here is Java source for a Quickselect algorithm to find the k'th element in an array:

/**
 * Returns position of k'th largest element of sub-list.
 * 
 * @param list list to search, whose sub-list may be shuffled before
 *            returning
 * @param lo first element of sub-list in list
 * @param hi just after last element of sub-list in list
 * @param k
 * @return position of k'th largest element of (possibly shuffled) sub-list.
 */
static int select(double[] list, int lo, int hi, int k) {
    int n = hi - lo;
    if (n < 2)
        return lo;

    double pivot = list[lo + (k * 7919) % n]; // Pick a random pivot

    // Triage list to [<pivot][=pivot][>pivot]
    int nLess = 0, nSame = 0, nMore = 0;
    int lo3 = lo;
    int hi3 = hi;
    while (lo3 < hi3) {
        double e = list[lo3];
        int cmp = compare(e, pivot);
        if (cmp < 0) {
            nLess++;
            lo3++;
        } else if (cmp > 0) {
            swap(list, lo3, --hi3);
            if (nSame > 0)
                swap(list, hi3, hi3 + nSame);
            nMore++;
        } else {
            nSame++;
            swap(list, lo3, --hi3);
        }
    }
    assert (nSame > 0);
    assert (nLess + nSame + nMore == n);
    assert (list[lo + nLess] == pivot);
    assert (list[hi - nMore - 1] == pivot);
    if (k >= n - nMore)
        return select(list, hi - nMore, hi, k - nLess - nSame);
    else if (k < nLess)
        return select(list, lo, lo + nLess, k);
    return lo + k;
}

I have not included the source of the compare and swap methods, so it's easy to change the code to work with Object[] instead of double[].

In practice, you can expect the above code to be o(N).

Deettadeeyn answered 31/12, 2014 at 23:9 Comment(0)
E
0

Let the problem be: finding the Kth largest element in an unsorted array.

Divide the array into n/5 groups where each group consisting of 5 elements.

Now a1,a2,a3....a(n/5) represent the medians of each group.

x = Median of the elements a1,a2,.....a(n/5).

Now if k<n/2 then we can remove the largets, 2nd largest and 3rd largest element of the groups whose median is greater than the x. We can now call the function again with 7n/10 elements and finding the kth largest value.

else if k>n/2 then we can remove the smallest ,2nd smallest and 3rd smallest element of the group whose median is smaller than the x. We can now call the function of again with 7n/10 elements and finding the (k-3n/10)th largest value.

Time Complexity Analysis: T(n) time complexity to find the kth largest in an array of size n.

T(n) = T(n/5) + T(7n/10) + O(n)

if you solve this you will find out that T(n) is actually O(n)

n/5 + 7n/10 = 9n/10 < n

Erector answered 17/10, 2020 at 7:30 Comment(1)
Median of medians does not give the median, only an approximate median (within ~30-70% I seem to recall). There is still no linear algorithm in existence AFAIK for finding the median of an arbitrary, unsorted dataset. Think about it - a dataset that contains values 1..25 has true median 13. If one of its MoM n/5 subsets is [13,14,15,16,17], the median of that subset is 15 (and 13 falls out of contention).Redtop
H
0

Notice that building a heap takes O(n) actually not O(nlogn), you can check this using amortized analysis or simply check in Youtube. Extract-Min takes O(logn), therefore, extracting n/2 will take (nlogn/2) = O(nlogn) amortized time.

About your question, you can simply check at Median of Medians.

Holdup answered 16/6, 2021 at 19:43 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.