Fast alternative for numpy.median.reduceat
Asked Answered
I

4

12

Relating to this answer, is there a fast way to compute medians over an array that has groups with an unequal number of elements?

E.g.:

data =  [1.00, 1.05, 1.30, 1.20, 1.06, 1.54, 1.33, 1.87, 1.67, ... ]
index = [0,    0,    1,    1,    1,    1,    2,    3,    3,    ... ]

And then I want to compute the difference between the number and the median per group (e.g. median of group 0 is 1.025 so the first result is 1.00 - 1.025 = -0.025). So for the array above, the results would appear as:

result = [-0.025, 0.025, 0.05, -0.05, -0.19, 0.29, 0.00, 0.10, -0.10, ...]

Since np.median.reduceat doesn't exist (yet), is there another fast way to achieve this? My array will contain millions of rows so speed is crucial!

Indices can be assumed to be contiguous and ordered (it's easy to transform them if they aren't).


Example data for performance comparisons:

import numpy as np

np.random.seed(0)
rows = 10000
cols = 500
ngroup = 100

# Create random data and groups (unique per column)
data = np.random.rand(rows,cols)
groups = np.random.randint(ngroup, size=(rows,cols)) + 10*np.tile(np.arange(cols),(rows,1))

# Flatten
data = data.ravel()
groups = groups.ravel()

# Sort by group
idx_sort = groups.argsort()
data = data[idx_sort]
groups = groups[idx_sort]
Igor answered 10/11, 2019 at 11:9 Comment(4)
Did you time the scipy.ndimage.median suggestion in the linked answer? It doesn't seem to me that it needs an equal number of elements per label. Or did I miss something?Ruffin
So, when you said millions of row, is your actual dataset a 2D array and you are doing this operation on each of those rows?Lakitalaks
@Lakitalaks See edit to question for testing dataIgor
You already gave benchmark in the initial data, I inflated it to keep the format the same. Everything is benchmarked against my inflated dataset. It's not reasonable to change it nowPhagocytosis
P
7

Sometimes you need to write non-idiomatic numpy code if you really want to speed up your calculation which you can't do with native numpy.

numba compiles your python code to low-level C. Since a lot of numpy itself is usually as fast as C, this mostly ends up being useful if your problem doesn't lend itself to native vectorization with numpy. This is one example (where I assumed that the indices are contiguous and sorted, which is also reflected in the example data):

import numpy as np
import numba

# use the inflated example of roganjosh https://stackoverflow.com/a/58788534
data =  [1.00, 1.05, 1.30, 1.20, 1.06, 1.54, 1.33, 1.87, 1.67]
index = [0,    0,    1,    1,    1,    1,    2,    3,    3] 

data = np.array(data * 500) # using arrays is important for numba!
index = np.sort(np.random.randint(0, 30, 4500))               

# jit-decorate; original is available as .py_func attribute
@numba.njit('f8[:](f8[:], i8[:])') # explicit signature implies ahead-of-time compile
def diffmedian_jit(data, index): 
    res = np.empty_like(data) 
    i_start = 0 
    for i in range(1, index.size): 
        if index[i] == index[i_start]: 
            continue 

        # here: i is the first _next_ index 
        inds = slice(i_start, i)  # i_start:i slice 
        res[inds] = data[inds] - np.median(data[inds]) 

        i_start = i 

    # also fix last label 
    res[i_start:] = data[i_start:] - np.median(data[i_start:])

    return res

And here are some timings using IPython's %timeit magic:

>>> %timeit diffmedian_jit.py_func(data, index)  # non-jitted function
... %timeit diffmedian_jit(data, index)  # jitted function
...
4.27 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
65.2 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Using the updated example data in the question these numbers (i.e. the runtime of the python function vs. the runtime of the JIT-accelerated functio) are

>>> %timeit diffmedian_jit.py_func(data, groups) 
... %timeit diffmedian_jit(data, groups)
2.45 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
93.6 ms ± 518 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This amounts to a 65x speedup in the smaller case and a 26x speedup in the larger case (compared to slow loopy code, of course) using the accelerated code. Another upside is that (unlike typical vectorization with native numpy) we didn't need additional memory to achieve this speed, it's all about optimized and compiled low-level code that ends up being run.


The above function assumes that numpy int arrays are int64 by default, which is not actually the case on Windows. So an alternative is to remove the signature from the call to numba.njit, triggering proper just-in-time compilation. But this means that the function will be compiled during the first execution, which can meddle with timing results (we can either execute the function once manually, using representative data types, or just accept that the first timing execution will be much slower, which should be ignored). This is exactly what I tried to prevent by specifying a signature, which triggers ahead-of-time compilation.

Anyway, in the properly JIT case the decorator we need is just

@numba.njit
def diffmedian_jit(...):

Note that the above timings I showed for the jit-compiled function only apply once the function had been compiled. This either happens at definition (with eager compilation, when an explicit signature is passed to numba.njit), or during the first function call (with lazy compilation, when no signature is passed to numba.njit). If the function is only going to be executed once then the compile time should also be considered for the speed of this method. It's typically only worth compiling functions if the total time of compilation + execution is less than the uncompiled runtime (which is actually true in the above case, where the native python function is very slow). This mostly happens when you are calling your compiled function a lot of times.

As max9111 noted in a comment, one important feature of numba is the cache keyword to jit. Passing cache=True to numba.jit will store the compiled function to disk, so that during the next execution of the given python module the function will be loaded from there rather than recompiled, which again can spare you runtime in the long run.

Proudman answered 10/11, 2019 at 12:26 Comment(9)
@Lakitalaks indeed, it assumes the indices are contiguous and sorted, which seemed like an assumption in OP's data, and also automatically included in roganjosh's index data. I'll leave a note about this, thanks :)Ruffin
OK, the contiguousness is not automatically included...but I'm pretty sure it has to be contiguous anyway. Hmm...Ruffin
Nevermind, it shouldn't be an issue if labels are missing in my version. Just have to sort.Ruffin
@AndrasDeak It's indeed fine to assume the labels are contiguous and sorted (fixing them if not is easy anyway)Igor
@AndrasDeak See edit to question for testing data (such that performance comparisons across questions are consistent)Igor
@AndrasDeak Will this approach be faster than Divakar's answer ?Igor
@Igor it probably will, but on my machine it takes 0.66 seconds to jit-compile the function. If you are going to execute it once this is probably not the fastest solution. Then again if you're doing this ones and it takes ~1 second at worst to execute, you should pick the most readable solution (this calculation probably only becomes a performance bottleneck if you keep doing it a lot of times).Ruffin
You could mention the keyword cache=True to avoid recompilation on every restart of the interpreter.Agential
@Agential I've never used it but that's a very good point, thanks, added it to my answer.Ruffin
P
5

One approach would be to use Pandas here purely to make use of groupby. I've inflated the input sizes a bit to give a better understanding of the timings (since there is overhead in creating the DF).

import numpy as np
import pandas as pd

data =  [1.00, 1.05, 1.30, 1.20, 1.06, 1.54, 1.33, 1.87, 1.67]
index = [0,    0,    1,    1,    1,    1,    2,    3,    3]

data = data * 500
index = np.sort(np.random.randint(0, 30, 4500))

def df_approach(data, index):
    df = pd.DataFrame({'data': data, 'label': index})
    df['median'] = df.groupby('label')['data'].transform('median')
    df['result'] = df['data'] - df['median']

Gives the following timeit:

%timeit df_approach(data, index)
5.38 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For the same sample size, I get the dict approach of Aryerez to be:

%timeit dict_approach(data, index)
8.12 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, if we increase the inputs by another factor of 10, the timings become:

%timeit df_approach(data, index)
7.72 ms ± 85 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit dict_approach(data, index)
30.2 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

However, at the expense of some reability, the answer by Divakar using pure numpy comes in at:

%timeit bin_median_subtract(data, index)
573 µs ± 7.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In light of the new dataset (which really should have been set at the start):

%timeit df_approach(data, groups)
472 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit bin_median_subtract(data, groups) #https://mcmap.net/q/922899/-fast-alternative-for-numpy-median-reduceat
3.02 s ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit dict_approach(data, groups) #https://mcmap.net/q/922899/-fast-alternative-for-numpy-median-reduceat
<I gave up after 1 minute>

# jitted (using @numba.njit('f8[:](f8[:], i4[:]') on Windows) from  https://mcmap.net/q/922899/-fast-alternative-for-numpy-median-reduceat
%timeit diffmedian_jit(data, groups)
132 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Phagocytosis answered 10/11, 2019 at 12:12 Comment(5)
Thank you for this answer! For consistency with the other answers, could you test your solutions on the example data provided in the edit to my question?Igor
@Igor the timings are consistent already, no? They used my initial benchmark data, and in the cases that they didn't, I've provided the timings for them with the same benchmarkPhagocytosis
I overlooked you also added a reference to Divakar's answer already so your answer indeed already makes a nice comparison between the different approaches, thanks for that!Igor
@Igor I added the latest timings at the bottom since it actually changed things quite drasticallyPhagocytosis
Apologies for not adding the test set when posting the question, highly appreciated that you still added the test results now! Thanks!!!Igor
H
4

Maybe you already did this, but if not, see if that's fast enough:

median_dict = {i: np.median(data[index == i]) for i in np.unique(index)}
def myFunc(my_dict, a): 
    return my_dict[a]
vect_func = np.vectorize(myFunc)
median_diff = data - vect_func(median_dict, index)
median_diff

Output:

array([-0.025,  0.025,  0.05 , -0.05 , -0.19 ,  0.29 ,  0.   ,  0.1  ,
   -0.1  ])
Hircine answered 10/11, 2019 at 11:28 Comment(6)
At the risk of stating the obvious, np.vectorize is a very thin wrapper for a loop, so I wouldn't expect this approach to be particularly fast.Ruffin
@AndrasDeak I don't disagree :) I'll keep following, and if someone would post a better solution, I'll delete it.Hircine
I don't think you'd have to delete it even if faster approaches pop up :)Ruffin
@Phagocytosis That's probably because you didn't define data and index as np.arrays as in the question.Hircine
@Hircine Thank you for your answer, the documentation of numpy.vectorize indeed states that The vectorize function is provided primarily for convenience, not for performance.. Nevertheless, your answer is very informative for comparisons between alternatives. Could your provide a timeit of your answer given the new testing data that I just added to my question?Igor
@Igor roganjosh did a time comparison between mine and his methods, and others here compared theirs. It depends on the computer hardware, so there is no point of everyone checking their own methods, but it does appear that I came up with the slowest solution here.Hircine
L
4

Here's a NumPy based approach to get binned-median for positive bins/index values -

def bin_median(a, i):
    sidx = np.lexsort((a,i))

    a = a[sidx]
    i = i[sidx]

    c = np.bincount(i)
    c = c[c!=0]

    s1 = c//2

    e = c.cumsum()
    s1[1:] += e[:-1]

    firstval = a[s1-1]
    secondval = a[s1]
    out = np.where(c%2,secondval,(firstval+secondval)/2.0)
    return out

To solve our specific case of subtracted ones -

def bin_median_subtract(a, i):
    sidx = np.lexsort((a,i))

    c = np.bincount(i)

    valid_mask = c!=0
    c = c[valid_mask]    

    e = c.cumsum()
    s1 = c//2
    s1[1:] += e[:-1]
    ssidx = sidx.argsort()
    starts = c%2+s1-1
    ends = s1

    starts_orgindx = sidx[np.searchsorted(sidx,starts,sorter=ssidx)]
    ends_orgindx  = sidx[np.searchsorted(sidx,ends,sorter=ssidx)]
    val = (a[starts_orgindx] + a[ends_orgindx])/2.
    out = a-np.repeat(val,c)
    return out
Lakitalaks answered 10/11, 2019 at 12:25 Comment(3)
Very nice answer! Do you have any indication as of the speed improvement over e.g. df.groupby('index').transform('median')?Igor
@Igor Can you test out on your actual dataset of millions?Lakitalaks
@Igor Edited my solution for a simpler one. Make sure to use this one for testing, if you are.Lakitalaks

© 2022 - 2025 — McMap. All rights reserved.