how to calculate running median efficiently
Asked Answered
T

4

7

I borrowed some code trying to implement a function to calculate the running median for a ton of data. The current one is too slow for me (The tricky part is that I need to exclude all zeros from the running box). Below is the code:

from itertools import islice
from collections import deque
from bisect import bisect_left,insort

def median(s):
    sp = [nz for nz in s if nz!=0]
    print sp
    Mnow = len(sp)
    if Mnow == 0:
        return 0
    else:
        return np.median(sp)

def RunningMedian(seq, M):
    seq = iter(seq)
    s = []

    # Set up list s (to be sorted) and load deque with first window of seq
    s = [item for item in islice(seq,M)]
    d = deque(s)

    # Sort it in increasing order and extract the median ("center" of the sorted window)
    s.sort()
    medians = [median(s)]
    for item in seq:
        old = d.popleft()          # pop oldest from left
        d.append(item)             # push newest in from right
        del s[bisect_left(s, old)] # locate insertion point and then remove old 
        insort(s, item)            # insert newest such that new sort is not required        
        medians.append(median(s))
    return medians

It works well, the only drawback is that it is too slow. Any one could help me to improve the code to be more efficient?

After I explored all the possibilities, the following simple code can calculate comparably efficiently:

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works
Tadtada answered 7/6, 2016 at 5:47 Comment(11)
Possible duplicate of Rolling median algorithm in CAbacist
You can attract more answers if you move this question to Code Review Stack Exchange.Lisandralisbeth
Didn't Running or sliding median, mean and standard deviation work for you or was it too slow for your needs?Faery
@Faery Thank you very much. It works, but still too slow for a huge dataset.Tadtada
@Faery And also I want to exclude the zeros from the running box. I did not how to implement that from your answer. Do you have any ideas? Thank you.Tadtada
What is too slow for you? How big is your data and how long do you expect it to take? Median of large datasets is a quite expensive calculation... a rolling median is even more.Jag
Take a look at this. It's a lot of code, and being written in Python may not be all that fast. But the implementation is about as fast as it gets, and it uses the same approach that pandas does for its rolling median calculations, but written in Cython for speed.Anacoluthon
why not using pandas.rolling_* methods ?Gunwale
@ImanolLuengo The array has 300 elements, I want the window size to be 25, and there are 20 million such arrays to be analyzed.Tadtada
@Anacoluthon I see some similar codes, but they do not have the requirement that the running box should exclude all zeros. This is the tricky part.Tadtada
@Apero Again the pandas can not deal with non-zero running box.Tadtada
T
5

One approach is below:

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works

I found a much faster one (tens of thousand times faster), copied as below:

from collections import deque
from bisect import insort, bisect_left
from itertools import islice
def running_median_insort(seq, window_size):
    """Contributed by Peter Otten"""
    seq = iter(seq)
    d = deque()
    s = []
    result = []
    for item in islice(seq, window_size):
        d.append(item)
        insort(s, item)
        result.append(s[len(d)//2])
    m = window_size // 2
    for item in seq:
        old = d.popleft()
        d.append(item)
        del s[bisect_left(s, old)]
        insort(s, item)
        result.append(s[m])
    return result

Take a look at the link: running_median

Tadtada answered 28/7, 2017 at 20:54 Comment(2)
"A much faster one" gives incorrect median when the length of the window is even. Correct one is to take mean between a pair of center elements.Hayton
You may check some simple case like step function with window size 3 - both functions give incorrect results on boundary of the step function. And both doesn't preserve phase.Hayton
F
7

You can use two heaps to maintain the lower and higher halves of the data sample as you process it. The algorithm goes like this: for each value, put it to an appropriate heap and 'rebalance' the heaps so that their size is not different by more than 1. Then, to get the median, just pull out the first element from the bigger heap, or take an average of the first elements of the two heaps if they are of equal size. This solution has O(n log(n)) time complexity.

from heapq import heappush, heappop


class RunningMedian:
    def __init__(self):
        self.lowers, self.highers = [], []

    def add_number(self, number):
        if not self.highers or number > self.highers[0]:
            heappush(self.highers, number)
        else:
            heappush(self.lowers, -number)  # for lowers we need a max heap
        self.rebalance()

    def rebalance(self):
        if len(self.lowers) - len(self.highers) > 1:
            heappush(self.highers, -heappop(self.lowers))
        elif len(self.highers) - len(self.lowers) > 1:
            heappush(self.lowers, -heappop(self.highers))

    def get_median(self):
        if len(self.lowers) == len(self.highers):
            return (-self.lowers[0] + self.highers[0])/2
        elif len(self.lowers) > len(self.highers):
            return -self.lowers[0]
        else:
            return self.highers[0]

Demo:

>>> running_median = RunningMedian()
>>> for n in (12, 4, 5, 3, 8, 7):
...     running_median.add_number(n)
...     print(running_median.get_median())
... 
12
8.0
5
4.5
5
6.0
Febrifugal answered 27/1, 2019 at 15:12 Comment(2)
What is running about this? The heap grows without bounds. Normally you have a window over the samples to compute the running median. Therefore you need both to order the samples in the window and to be able to pop samples in the order that they have been introduced. An efficient algorithm should reuse the ordering when adding/removing elements - which your implementation does - while still keeping window size - which your implementation does not.Goldshell
Looks like this is computing the running mean, not the median.Enchase
T
5

One approach is below:

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works

I found a much faster one (tens of thousand times faster), copied as below:

from collections import deque
from bisect import insort, bisect_left
from itertools import islice
def running_median_insort(seq, window_size):
    """Contributed by Peter Otten"""
    seq = iter(seq)
    d = deque()
    s = []
    result = []
    for item in islice(seq, window_size):
        d.append(item)
        insort(s, item)
        result.append(s[len(d)//2])
    m = window_size // 2
    for item in seq:
        old = d.popleft()
        d.append(item)
        del s[bisect_left(s, old)]
        insort(s, item)
        result.append(s[m])
    return result

Take a look at the link: running_median

Tadtada answered 28/7, 2017 at 20:54 Comment(2)
"A much faster one" gives incorrect median when the length of the window is even. Correct one is to take mean between a pair of center elements.Hayton
You may check some simple case like step function with window size 3 - both functions give incorrect results on boundary of the step function. And both doesn't preserve phase.Hayton
C
1

There's a nice comparison of some speeds of different methods here. But, in general, the scipy.ndimage.median_filter seems to be a simple and efficient method. To include the exclusion of zeros one could do:

from scipy.ndimage import median_filter

def RunningMedian(x, N):
    return median_filter(x[x != 0], N)
Clayborne answered 6/4, 2022 at 10:18 Comment(0)
M
0

I faced this problem but not as a programmer. I wanted dedicated hardware to do a moving median of a sequence, with a fairly large moving window.

I did it with the two heaps as suggested above, but I also noticed that all the data movement and comparisons within a heap can be done in parallel. So the time required to insert a datum into either heap or to move a datum from either heap was O(1), a constant.

Monophagous answered 7/2, 2022 at 19:25 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.