Finding first samples greater than a threshold value efficiently in Python (and MATLAB comparison)

Asked 26/5, 2014 at 8:3 Answered 27/5, 2014 at 6:27

Instead of finding all the samples / data points within a list or an array which are greater than a particular threshold, I would like to find only the first samples where a signal becomes greater than a threshold. The signal might cross the threshold several times. For example if I have an example signal:

signal = [1, 2, 3, 4, 4, 3, 2, 1, 0, 3, 2, 1, 0, 0, 1, 1, 4, 8, 7, 6, 5, 0]

and a threshold = 2, then

signal = numpy.array(signal)
is_bigger_than_threshold = signal > threshold

would give me all values in signalwhich are greater than threshold. However, I would like to get only the first samples whenever signal becomes greater than threshold. Therefore, I am going through the whole list and make boolean comparisons like

first_bigger_than_threshold = list()
first_bigger_than_threshold.append(False)
for i in xrange(1, len(is_bigger_than_threshold)):
    if(is_bigger_than_threshold[i] == False):
        val = False
    elif(is_bigger_than_threshold[i]):
        if(is_bigger_than_threshold[i - 1] == False):
            val = True
        elif(is_bigger_than_threshold[i - 1] == True):
            val = False
    first_bigger_than_threshold.append(val)

This gives me the result I was looking for, namely

[False, False, True, False, False, False, False, False, False, True, False, False, False,   
False, False, False, True, False, False, False, False, False]

In MATLAB I would do similarily

for i = 2 : numel(is_bigger_than_threshold)
    if(is_bigger_than_threshold(i) == 0)
        val = 0;
    elseif(is_bigger_than_threshold(i))
        if(is_bigger_than_threshold(i - 1) == 0)
            val = 1;
        elseif(is_bigger_than_threshold(i - 1) == 1)
            val = 0;
        end
    end
    first_bigger_than_threshold(i) = val;
end % for

Is there a more efficient (faster) way to perform this calculation?

If I generate data in Python, e.g.

signal = [round(random.random() * 10) for i in xrange(0, 1000000)]

and time it, calculating these values took 4.45 seconds. If I generate data in MATLAB

signal = round(rand(1, 1000000) * 10);

and execute the program it takes only 0.92 seconds.

Why is MATLAB almost 5 times quicker than Python performing this task?

Thanks in advance for your comments!

Disincentive answered 26/5, 2014 at 8:3 Comment(0)

The other answers give you positions of first Trues, if you want a bool array that marks the first True, you can do it faster:

import numpy as np

signal = np.random.rand(1000000)
th = signal > 0.5
th[1:][th[:-1] & th[1:]] = False

Cracknel answered 27/5, 2014 at 6:27 Comment(2)

This solution is really even faster than the other two options above. Thanks. – Disincentive 27/5, 2014 at 7:15

It is indeed faster - and it scales better too. I suspect on large arrays there are fewer cache misses. Nice solution! – Vidal 27/5, 2014 at 9:21

This post explains why your code is slower than Matlab.

Try this code

import numpy as np

threshold = 2
signal = np.array([1, 2, 3, 4, 4, 3, 2, 1, 0, 3, 2, 1, 0, 0, 1, 1, 4, 8, 7, 6, 5, 0])

indices_bigger_than_threshold = np.where(signal > threshold)[0] # get item
print indices_bigger_than_threshold
# [ 2  3  4  5  9 16 17 18 19 20]
non_consecutive = np.where(np.diff(indices_bigger_than_threshold) != 1)[0]+1 # +1 for selecting the next
print non_consecutive
# [4 5]
first_bigger_than_threshold1 = np.zeros_like(signal, dtype=np.bool)
first_bigger_than_threshold1[indices_bigger_than_threshold[0]] = True # retain the first
first_bigger_than_threshold1[indices_bigger_than_threshold[non_consecutive]] = True

np.where returns indices that match the condition.

The strategy is to get indices upper than threshold and remove the consecutive.

BTW, welcome to Python/Numpy world.

Annular answered 26/5, 2014 at 8:37 Comment(1)

Thank you for the link to the explanation about the JIT accelerator. That's good to know when using loops. Tested your version above which works fine and took only 0.02 seconds on my machine. – Disincentive 26/5, 2014 at 12:9

Based on the notion that the best way to speed things up is to pick the best algorithm, you can do this neatly with a simple edge detector:

import numpy

signal = numpy.array([1, 2, 3, 4, 4, 3, 2, 1, 0, 3, 2, 1, 0, 0, 1, 1, 4, 8, 7, 6, 5, 0])

thresholded_data = signal > threshold
threshold_edges = numpy.convolve([1, -1], thresholded_data, mode='same')

thresholded_edge_indices = numpy.where(threshold_edges==1)[0]

print(thresholded_edge_indices)

prints [2 9 16], the indices corresponding to first entry in a sequence greater than the threshold. This will make things faster in both Matlab and Python (with Numpy) - on my machine about 12ms to do what took you 4.5s.

Edit: As pointed out by @eickenberg, the convolution can be replaced with numpy.diff(thresholded_data), which is conceptually a bit simpler, though in that case the indices will be out by 1, so remember to add those back in, and also to convert thresholded_data to be an array of ints with thresholded_data.astype(int). There is no appreciable speed difference between the two methods.

Vidal answered 26/5, 2014 at 10:50 Comment(9)

This looks like the way to go. I propose using numpy.diff for clarity and probably speed (edit proposed). – Kufic 26/5, 2014 at 11:1

It's actually slightly slower using numpy.diff (though doubtless not enough to worry about) (and it wasn't rejected by me!). – Vidal 26/5, 2014 at 11:3

@Henry Gomersall: Thank you for your suggestion with the edge detector. This works for me as well and yields a similar speed than mskimm's solution on my PC. – Disincentive 26/5, 2014 at 12:25

@Disincentive make sure you initialise your array as a numpy array - converting between lists will slow you down if you need to do this lots of times. – Vidal 26/5, 2014 at 12:52

Thanks for testing that! The fact that np.diff is not faster can be due to sample size. If it is always slower then this is really interesting, since that means that np.convolution does its (somewhat more general) job really efficiently even in specific cases such as this one. If using np.diff watch out that you use it on thresholded_data.astype(int), since otherwise all differences will just evaluate to True, also the -1 ones. – Kufic 26/5, 2014 at 13:37

@Kufic On rechecking, the timing difference is not substantial and is actually the other way around (11ms for the diff vs 12ms for convolve). I'll add a comment about astype. It's not wholly unexpected - a length 2 convolution is not exactly onerous! – Vidal 26/5, 2014 at 13:52

Is it possible to use this code to find the last instead of the first point? – Colloquium 1/5, 2018 at 18:10

@VictorAguiar sure, just change the comparison to -1 and subtract one from the indices (since -1 is the next value below the threshold). Note that this will ignore a run that never drops below the threshold at the end of a sequence because it only detects transitions. – Vidal 2/5, 2018 at 15:7

@HenryGomersall thank you so much for your support! It helped a lot! – Colloquium 7/5, 2018 at 0:41

Recommended topics

Hot tags