Keeping track of the median of an expanding array
Asked Answered
K

5

15

Interview Question:

Edited Below You are given an array. You make 2 heaps out of it, one minheap and the other max heap. Now find the median of the array using these 2 provided heaps in O(nlog n) time.

Corrected Question Numbers are randomly generated and stored into an (expanding) array. How would you keep track of the median?

Solution This problem can be solved using 2 heaps and the median can always be accessed in O(1) time.

Kara answered 2/2, 2011 at 17:20 Comment(8)
I am guessing the actual question is to be able to determine median quickly even after many inserts and somehow got lost in translation. Frankly, this looks like a homework question disguised as an interview question.Descend
@Moron: Disagree with homework assessment. It's too easy to just copy the homework statement and not introduce "lost in translation" issues. This smells more like an interview question that got lost in translation.Canaan
@Jason: I am not saying EFreak has done this (or hasn't). It could be that the interviewer might have gone through a text book...Descend
If you are allowed to take O(N log N) time, why not just sort the array instead?Foote
Guys! I think I messed up with the question :P n @Moron: I don't go to school :) so no question of homework. I'm posting the corrected question now.Kara
@EFreak: So it was as I guessed! The interviewer seems to be going through homework problems :-)Descend
@Efreak: #2214207Descend
@Moron: That was indeed a nice find. Thanks!Kara
C
13

Here's how you use both heaps. Note that I'm assuming you don't know the number of elements, and this is why we must pop until we pop something from the min heap that is larger than or equal to what we pop from the max heap. Note that we return the average because in the case of a set like {1, 2, 3, 4} the median is actually 2.5 (the average of the two "middle" values). I'm assuming double as the value type, but this can obviously be anything. Here:

double min = minheap.pop();
double max = maxheap.pop();
while(min < max) {
    min = minheap.pop();
    max = maxheap.pop();
}

return (min + max) / 2;

Since popping is O(log n) and we have to pop O(n / 2) values, this is O(n log n).

Canaan answered 2/2, 2011 at 17:39 Comment(11)
@marcog: Did you even bother to think about it? Consider, e.g., {1, 2, 3}. Pop gives min as 1 and max as 3. Pop again gives min as 2 and max as 2. Condition min < max fails so we return (min + max) / 2 = 2. I challenge you to provide an example where this algorithm fails.Canaan
@Jason Consider 1, 2, 2. Pop gives min 1 max 2. Pop again gives min 2 max 1. Then you return 1.5 when the answer should be 2. (EDIT: I was wrong, the second pop gives min 2 max 2.)Vasty
@Jason I fail, sorry I just realised my mistake. Inverting vote. Apologies.Vasty
@marcog: No, you don't understand the algorithm. Pop gives min as 1 and max as 2. Condition min < max is okay, so proceed to next iteration. Pop gives min as 2 and max as 2 (yes, again, that's how a heap works). Condition min < max fails so return (min + max) / 2 = 2.Canaan
Yeah, so this is appropriate when you don't know N, but if you're given an array to begin with then obviously you do. Even then, two heaps is still overkill - you can find N by finding the depth of the tree O(log N) or, if the heap is stored as an array, N is the length of the array. I know this isn't answering the original question, so I'm guess I'm curious as to whether there's a case where using two heaps is the only, or the optimal, solution?Xl
@britishmutt: I think it's reasonable to assume that you don't know the number of elements otherwise there is no point to using two heaps (you can do it with one if you do know the number of elements). I suspect that something was lost in the translation from the interviewer to the interviewee to being posted on SO.Canaan
@Jason I get that. I still don't see the need for two heaps :)Xl
Alright, my final thought: maybe if all you are given is a well-encapsulated Heap object and the only method you have access to is getMin(), so there's no way to inspect the underlying data structure to find N. Then, I guess, two heaps is the only way. I'm going to go look at other questions now :)Xl
@Jason, if I'm correct your solution shows O(n log n) time complexity for a single median lookup, which btw is what you have by sorting the array and taking the middle element. The question asks for keeping the median of an expanding array, and I'm not sure I understand how your solution works there (or if it aimed to). Thx.Algetic
@britishmutt two heaps IMHO are needed if you have a stream of int in input and you want to keep the median updated. This will take O(n log n) time and O(n) space.Algetic
@Savino Sguera: Note that the question was edited over a month after my answer was provided.Canaan
A
6

A working implementation in java, using 2 heaps, O(n log n). At any time I keep the two heaps balanced in size (ie. they differ at most by 1, if we entered n elements such that n%2==1). Getting the median is O(1). Adding a new element is O(log n).

public class MedianOfStream {

    private int count;
    private PriorityQueue<Integer> highs, lows;

    public MedianOfStream() {
        highs = new PriorityQueue<Integer>(11, new Comparator<Integer>() {
            @Override
            public int compare(Integer arg0, Integer arg1) {
                return arg0.compareTo(arg1);
            }
        });
        lows = new PriorityQueue<Integer>(11, new Comparator<Integer>() {
            @Override
            public int compare(Integer arg0, Integer arg1) {
                return arg1.compareTo(arg0);
            }
        });
    }

    private int getMedian() {
        if (count == 0)
            return 0;
        if (lows.size() == highs.size()) {
            return (lows.peek() + highs.peek()) / 2;
        } else if (lows.size() < highs.size()) {
            return highs.peek();
        }
        return lows.peek();
    }

    private void swap(){
        int h = highs.poll();
        int l = lows.poll();
        highs.add(l);
        lows.add(h);
    }

    public int updateMedian(int n) {
        count++;

        if (count == 1)
            lows.add(n);

        else if (count==2) {
            highs.add(n);
            if(highs.peek()<lows.peek()) {
                swap(); // O(log n)
            }
        }

        else {
            if (n > highs.peek()) {
                lows.add(highs.poll()); // O(log n)
                highs.add(n); // O(log n)
            } else {
                highs.add(lows.poll()); // O(log n)
                lows.add(n); // O(log n)
            }
            if(highs.peek()<lows.peek()) {
                swap(); // O(log n)
            }
        }

        // if we added an even # of items,
        // the heaps must be exactly the same size,
        // otherwise we tolerate a 1-off difference

        if (Math.abs(lows.size() - highs.size()) > (count % 2)) {
            if (lows.size() < highs.size()) {
                lows.add(highs.poll()); // O(log n)
            } else {
                highs.add(lows.poll()); // O(log n)
            }
        }

        return getMedian(); // O(1)
    }
}
Algetic answered 25/1, 2012 at 0:11 Comment(0)
V
3

Popping from a heap is an O(log N) operation, so you can achieve O(N log N) by popping half the elements from one of the heaps and taking the last popped value (you'd have to handle edge cases). This doesn't take advantage of the other heap though.

You can achieve O(N) using the selection algorithm, but the constant factor is very high. The former suggestion is probably better if you already have a heap.

Vasty answered 2/2, 2011 at 17:27 Comment(19)
yeah .. the question was O(nlog n) itself.. posted it incorrectly.Kara
@Kara Thought so, answer updated. @Downvoter Please explainVasty
The only reason I can think to use two heaps is to cater for the edge case of N being even, where the median is then defined as the average of the two middle values which you'd get if you removed N/2 from both heaps. Seems like overkill for that purpose though.Xl
@britishmutt Exactly, that just doubles the runtime without really making the code any simpler.Vasty
Solving the problem in O(nlog n) isn't difficult. A simple quicksort and a fetch solves it. Problem is how do I use the 2 heaps to solve it?Kara
@EFreak: "simple" quicksort is O(n ²) worst case. Use mergesort.Connection
@Kara Why would you want to use two when one does the job?Vasty
@EFreak: Please see my solution: #4878104.Canaan
@marcog: Because typically in interview questions like this the point is to use what you are given.Canaan
@Jason No, you give the best answer. If your interviewer gives you a list of random numbers are you going to use it because he gave it to you?Vasty
@marcog: I'm going to do what the interviewer tells me. Here, according to @EFreak, find the median of the array using these 2 provided heaps in O(nlog n) time. There are many points to an interview question. One point is often to test your specific knowledge, often as a way of seeing how much knowledge you do possess. So here, for example, let's see if the interviewee knows about heaps. That will give me some insight into whether or not they know basic data structures and algorithms. A second point is to see if you're a team player, or can follow directions, etc.Canaan
@Jason And I did it using 2 heaps. I just ignored the second. If the interviewer is persistent about getting the worse answer, I'll consider that a sign to turn down any offer.Vasty
@marcog: Also, please note that we were not told that we know how many elements there are. If you don't know how many elements there are, you can't pop half of them. This is critical to realizing why you have to use both.Canaan
@Jason "You are given an array" <- More often than not, that comes with the size.Vasty
This is also (IMHO) a reason why asking vague interview questions on SO is a bad idea. You can't get clarification like you would in the interview.Vasty
@Larson: Thanks for educating me on that.Kara
@Marcog: May be the interviewer wants to check a no of points.. Even I couldn't find much use of 2 heaps. Thats the reason why I posted.Kara
@marcog: Yes, I agree with your comments about vague interview questions. I think it's fairly reasonable to assume, however, that you don't know the number of elements otherwise there is absolutely no point to having two heaps.Canaan
@Jason: There is a use of 2-heaps. If you allow future inserts, median finding can be made O(1) (inserts being O(log n)).Descend
O
0

Here's a python implementation for it. I tested some of the examples generated using Random.org and tested them on median finder most of them seem to work.

import heapq

def medianFinder(arr):
    minHeap = []
    maxHeap = []
    def handleMedianFinder(num: int):
        if not len(maxHeap):
            heapq.heappush(maxHeap, -num)
            return -maxHeap[0]
        elif not len(minHeap):
            heapq.heappush(minHeap, num)
            return (minHeap[0]-maxHeap[0])/2
        if num < minHeap[0]:
            if len(maxHeap) > len(minHeap):
                oldMedian = -heapq.heappop(maxHeap)
                heapq.heappush(minHeap, oldMedian)
                heapq.heappush(maxHeap, -num)
                return (minHeap[0]-maxHeap[0])/2 if (len(minHeap) + len(maxHeap))%2 == 0 else minHeap[0]
            heapq.heappush(maxHeap, -num)
        elif num > -maxHeap[0]:
            if len(maxHeap) < len(minHeap):
                oldMedian = heapq.heappop(minHeap)
                heapq.heappush(maxHeap, -oldMedian)
                heapq.heappush(minHeap, num)
                return (minHeap[0]-maxHeap[0])/2 if (len(minHeap) + len(maxHeap))%2 == 0 else -maxHeap[0]
            heapq.heappush(minHeap, num)
        else: # between the medians (new medians)
            if len(maxHeap) < len(minHeap):
                oldMedian = heapq.heappop(minHeap)
                heapq.heappush(maxHeap, -oldMedian)
                heapq.heappush(minHeap, num)
            else:
                oldMedian = -heapq.heappop(maxHeap)
                heapq.heappush(minHeap, oldMedian)
                heapq.heappush(maxHeap, -num)
        if len(minHeap) < len(maxHeap):
            return minHeap[0]
        elif len(maxHeap) < len(minHeap):
            return -maxHeap[0]
        return (minHeap[0]-maxHeap[0])/2
    for num in arr:
        # print('maxHeap => ', maxHeap)
        # print('minHeap => ', minHeap)
        print(handleMedianFinder(num))

        
arr1 = [11, 18, 16, 12, 14, 4, 15, 10, 19, 20]
medianFinder(arr1)
Ours answered 25/1 at 17:21 Comment(0)
B
-1

JavaScript solution using two heaps:

function addNewNumber(minHeap, maxHeap, randomNumber) {
  if (maxHeap.size() === minHeap.size()) {
    if (minHeap.peek() && randomNumber > minHeap.peek()) {
      maxHeap.insert(minHeap.remove());
      minHeap.insert(randomNumber);
    } else {
      maxHeap.insert(randomNumber);
    }
  } else {
    if (randomNumber < maxHeap.peek()) {
      minHeap.insert(maxHeap.remove());
      maxHeap.insert(randomNumber);
    } else {
      minHeap.insert(randomNumber);
    }
  }
}

function getMedian(minHeap, maxHeap) {
  if (!maxHeap.size()) {
    return 0;
  }
  if (minHeap.size() === maxHeap.size()) {
    return (minHeap.peek() + maxHeap.peek()) / 2;
  } else {
    return maxHeap.peek();
  }
}
Babirusa answered 24/11, 2014 at 3:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.