Efficient algorithm for evaluating a 1-d array of functions on a same-length 1d numpy array

Asked 11/2, 2015 at 16:51 Answered 11/2, 2015 at 21:0

Solved python performance numpy scientific-computing numba

I have a (large) length-N array of k distinct functions, and a length-N array of abcissa. I want to evaluate the functions at the abcissa to return a length-N array of ordinates, and critically, I need to do it very fast.

I have tried the following loop over a call to np.where, which is too slow:

Create some fake data to illustrate the problem:

def trivial_functional(i): return lambda x : i*x
k = 250
func_table = [trivial_functional(j) for j in range(k)]
func_table = np.array(func_table) # possibly unnecessary

We have a table of 250 distinct functions. Now I create a large array with many repeated entries of those functions, and a set of points of the same length at which these functions should be evaluated.

Npts = 1e6
abcissa_array = np.random.random(Npts)
function_indices = np.random.random_integers(0,len(func_table)-1,Npts)
func_array = func_table[function_indices]

Finally, loop over every function used by the data and evaluate it on the set of relevant points:

desired_output = np.zeros(Npts)
for func_index in set(function_indices):
    idx = np.where(function_indices==func_index)[0]
    desired_output[idx] = func_table[func_index](abcissa_array[idx])

This loop takes ~0.35 seconds on my laptop, the biggest bottleneck in my code by an order of magnitude.

Does anyone see how to avoid the blind lookup call to np.where? Is there a clever use of numba that can speed this loop up?

Cageling answered 11/2, 2015 at 16:51 Comment(8)

Might want to post this to codereview.stackexchange.com – Turbellarian 11/2, 2015 at 16:52

You will make it faster if you skip the call to np.where and use boolean indexing, i.e. idx = function_indices == func_index and everything else stays the same. – Fancied 11/2, 2015 at 17:6

@Fancied - this change actually slows the runtime to 0.5 seconds – Cageling 11/2, 2015 at 17:12

What's the time split between the 'where' line and last function evaluation line? – Merwyn 11/2, 2015 at 17:26

Are each of your functions limited to operating on only one abcissa point at a time? No possibility of taking in many at once? – Merwyn 11/2, 2015 at 17:28

@Merwyn - as written, the functions are taking in more than one abcissa point at once. In fact, each function takes in all the required points it calls at once, so this part of the calculation is already vectorized. Element-by-element evaluation is much slower than what I wrote. – Cageling 11/2, 2015 at 17:55

So it is the repeated call to where that is killing you. You need some sort of sort or groupby that can organize the indices once, and then give quick access in the loop. – Merwyn 11/2, 2015 at 18:25

Yes, @Merwyn , that's it! A groupby approach is just the ticket! I'll post my answer shortly, which beats the where loop by 20x. Many thanks for the idea! – Cageling 11/2, 2015 at 19:34

This does almost the same thing as your (excellent!) self-answer, but with a bit less rigamarole. It seems marginally faster on my machine as well -- about 30ms based on a cursory test.

def apply_indexed_fast(array, func_indices, func_table):
    func_argsort = func_indices.argsort()
    func_ranges = list(np.searchsorted(func_indices[func_argsort], range(len(func_table))))
    func_ranges.append(None)
    out = np.zeros_like(array)
    for f, start, end in zip(func_table, func_ranges, func_ranges[1:]):
        ix = func_argsort[start:end]
        out[ix] = f(array[ix])
    return out

Like yours, this splits a sequence of argsort indices into chunks, each of which corresponds to a function in func_table. It then uses each chunk to select input and output indices for its corresponding function. To determine the chunk boundaries, it uses np.searchsorted instead of np.unique -- where searchsorted(a, b) could be thought of as a binary search algorithm that returns the index of the first value in a equal to or greater than the given value or values in b.

Then the zip function simply iterates over its arguments in parallel, returning a single item from each one, collected together in a tuple, and stringing those together into a list. (So zip([1, 2, 3], ['a', 'b', 'c'], ['b', 'c', 'd']) returns [(1, 'a', 'b'), (2, 'b', 'c'), (3, 'c', 'd')].) This, along with the for statement's built-in ability to "unpack" those tuples, allows for a terse but expressive way to iterate over multiple sequences in parallel.

In this case, I've used it to iterate over the functions in func_tables alongside two out-of-sync copies of func_ranges. This ensures that the item from func_ranges in the end variable is always one step ahead of the item in the start variable. By appending None to func_ranges, I ensure that the final chunk is handled gracefully -- zip stops when any one of its arguments runs out of items, which cuts off the final value in the sequence. Conveniently, the None value also serves as an open-ended slice index!

Another trick that does the same thing requires a few more lines, but has lower memory overhead, especially when used with the itertools equivalent of zip, izip:

range_iter_a = iter(func_ranges)   # create generators that iterate over the 
range_iter_b = iter(func_ranges)   # values in `func_ranges` without making copies
next(range_iter_b, None)           # advance the second generator by one
for f, start, end in itertools.izip(func_table, range_iter_a, range_iter_b):
    ...

However, these low-overhead generator-based approaches can sometimes be a bit slower than vanilla lists. Also, note that in Python 3, zip behaves more like izip.

Dentistry answered 11/2, 2015 at 21:0 Comment(5)

Very nice. I can confirm your timings with my own independent tests: your solution is ~20% faster than mine, yours has more streamlined syntax, and our codes agree even on a few pernicious edge cases. This is great! – Cageling 11/2, 2015 at 23:15

@aph, glad to hear the tests agree -- I meant to run a few but had to go suddenly. I'll add a few words of explanation as well for future visitors. – Dentistry 11/2, 2015 at 23:34

it would be great if you wouldn't mind fleshing out a few things. Particularly your use of zip, which I didn't find intuitive, but which seems elegant, so it's syntax I'd like to learn on a practical application like this. – Cageling 12/2, 2015 at 1:16

Torn away again but just found time to post. Let me know if you have any more questions! – Dentistry 12/2, 2015 at 2:30

Well that is above and beyond the "call of duty", senderle. But, wow, a million thanks for taking the time. This is extremely instructive for me on several topics, including the bonus bit on itertools and generators. I've been writing my own homebrewed versions of groupby calculations for years (though not in python), and have never seen this calculation done so well before, so this is a real eye-opener of an SO answer for me. If I could give multiple up-votes, I would. Cheers! – Cageling 12/2, 2015 at 14:42

Thanks to hpaulj for the suggestion to pursue a groupby approach. There are lots of canned routines out there for this operation, such as Pandas DataFrames, but they all come with the overhead cost of the data structure initialization, which is one-time-only, but can be costly if using for just a single calculation.

Here is my pure numpy solution that is a factor of 13 faster than the original where loop I was using. The upshot summary is that I use np.argsort and np.unique together with some fancy indexing gymnastics.

First we sort the function indices, and then find the elements of the sorted array where each new index begins

idx_funcsort = np.argsort(function_indices)
unique_funcs, unique_func_indices = np.unique(function_indices[idx_funcsort], return_index=True)

Now there is no longer a need for blind lookups, since we know exactly which slice of the sorted array corresponds to each unique function. So we still loop over each called function, but without calling where:

for func_index in range(len(unique_funcs)-1):
    idx_func = idx_funcsort[unique_func_indices[func_index]:unique_func_indices[func_index+1]]
    func = func_table[unique_funcs[func_index]]
    desired_output[idx_func] = func(abcissa_array[idx_func])

That covers all but the final index, which somewhat annoyingly we need to call individually due to Python indexing conventions:

func_index = len(unique_funcs)-1
idx_func = idx_funcsort[unique_func_indices[func_index]:]
func = func_table[unique_funcs[func_index]]
desired_output[idx_func] = func(abcissa_array[idx_func])

This gives identical results to the where loop (a bookkeeping sanity check), but the runtime of this loop is 0.027 seconds, a speedup of 13x over my original calculation.

Cageling answered 11/2, 2015 at 19:47 Comment(1)

A small suggestion. x[a:] is equivalent to x[a:None] -- so instead of breaking out the last call, couldn't you just convert unique_func_indices to a plain list and append a None value? – Dentistry 11/2, 2015 at 20:4

That is a beautiful example of functional programming being somewhat emulated in Python.

Now, if you want to apply your function to a set of points, I'd recommend numpy's ufunc framework, which will allow you to create blazingly fast vectorized versions of your functions.

Soever answered 11/2, 2015 at 16:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags