Counting consecutive 1's in NumPy array
Asked Answered
C

2

6
[1, 1, 1, 0, 0, 0, 1, 1, 0, 0]

I have a NumPy array consisting of 0's and 1's like above. How can I add all consecutive 1's like below? Any time I encounter a 0, I reset.

[1, 2, 3, 0, 0, 0, 1, 2, 0, 0]

I can do this using a for loop, but is there a vectorized solution using NumPy?

Coccyx answered 9/2, 2017 at 5:37 Comment(0)
C
9

Here's a vectorized approach -

def island_cumsum_vectorized(a):
    a_ext = np.concatenate(( [0], a, [0] ))
    idx = np.flatnonzero(a_ext[1:] != a_ext[:-1])
    a_ext[1:][idx[1::2]] = idx[::2] - idx[1::2]
    return a_ext.cumsum()[1:-1]

Sample run -

In [91]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [92]: island_cumsum_vectorized(a)
Out[92]: array([1, 2, 3, 0, 0, 0, 1, 2, 0, 0])

In [93]: a = np.array([0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1])

In [94]: island_cumsum_vectorized(a)
Out[94]: array([0, 1, 2, 3, 4, 0, 0, 0, 1, 2, 0, 0, 1])

Runtime test

For the timings , I would use OP's sample input array and repeat/tile it and hopefully this should be a less opportunistic benchmark -

Small case :

In [16]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [17]: a = np.tile(a,10)  # Repeat OP's data 10 times

# @Paul Panzer's solution
In [18]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
10000 loops, best of 3: 73.4 µs per loop

In [19]: %timeit island_cumsum_vectorized(a)
100000 loops, best of 3: 8.65 µs per loop

Bigger case :

In [20]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [21]: a = np.tile(a,1000)  # Repeat OP's data 1000 times

# @Paul Panzer's solution
In [22]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
100 loops, best of 3: 6.52 ms per loop

In [23]: %timeit island_cumsum_vectorized(a)
10000 loops, best of 3: 49.7 µs per loop

Nah, I want really huge case :

In [24]: a = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])

In [25]: a = np.tile(a,100000)  # Repeat OP's data 100000 times

# @Paul Panzer's solution
In [26]: %timeit np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
1 loops, best of 3: 725 ms per loop

In [27]: %timeit island_cumsum_vectorized(a)
100 loops, best of 3: 7.28 ms per loop
Carrageen answered 9/2, 2017 at 6:22 Comment(20)
benchmark these: a1 = np.repeat(np.arange(1000)%2, np.random.randint(1, 1000, (1000,))); a2 = np.repeat(np.arange(100)%2, np.random.randint(1, 10000, (100,))); a3 = np.repeat(np.random.random((100,)) < 0.2, np.random.randint(1, 10000, (100,))) ;-)Lockwood
@PaulPanzer I decided to run all of these benchmarks myself on your suggested inputs. Divakar's implementation is still faster :P... by about 2x.Tucker
@rayreng Also on the last one? But I admit his is better; I'm just objecting to opportunistic benchmarks ;-)Lockwood
@PaulPanzer :D. I've already upvoted yours as it's very nice and in one line. The last one I just ran... it's just beaten by a slight margin.Tucker
@Tucker Strange, on my laptop (100 repeats) Divakar:0.47050797403790057;0.26334572909399867;0.3771821779664606 Paul Panzer:0.6545195570215583;0.4508163968566805;0.17192376195453107Lockwood
@PaulPanzer Added few less opportunistic benchmarks.Carrageen
I could start a big argument on how to scale a toy example, but who cares. You can have this one. Besides, I've already conceded yours is better. :-)Lockwood
@PaulPanzer You are looping, mine has some setup overhead. The only way a loopy one would be better would be if we are iterating less, i.e. either we are working with a very small array or we have less number of 1s islands. I tried your a3 and that's ~2x slower with the loopy one.Carrageen
@Carrageen that's interesting on my laptop the 'loopy one' is twice as fast, on your rig it's the other way round and for rayreng they are apparently almost tied. (I mean with a3). Any idea? Are you testing directly from the shell? I made a small file. Could the byte compilation make a difference?Lockwood
As I said earlier, if you feed very less number of 1s islands, a loopy one would be comparable or even faster. Use your a3 and then count the number of islands with len([c for c in np.split(a, 1 + np.where(np.diff(a))[0])])//2 and then compare this number with the length of array with a.size. You would see its a very very small number of islands. For the sample, OP had the no. of island as 2 for an array length of 10. Don't want to make this a research project, as my point about when to use a vectorized solution stays the same :)Carrageen
@Carrageen yes, I understand that which is why I designed the example in the first place :-)) What puzzles me are the differences between your and rayreng's and my benchmarking on the same type of sample. The ratio apparently varies by a factor of 4...Lockwood
@PaulPanzer Use np.random.random((10,)) < 0.2. you would have more erratic results across different runs. Use np.random.random((100,)) < 0.2 for less erratic runs. Use np.random.random((1000,)) < 0.2 and so on for lesser erratic ones, if you see the pattern there.Carrageen
@Carrageen Yes, but I can run it 10 times and I do see some variation, but nowhere near enough to explain the differences.Lockwood
@PaulPanzer Maybe you need to run it more number of times. Also, you need to wake up rayreng and ask him to do so too and finally ask me to do the same. Maybe we can sit together and discuss this together with politics sometime :)Carrageen
@Carrageen I give up :-)Lockwood
@PaulPanzer Offer a bounty and we might have an answer on that! One day a bounty might solve politics issues too, hopefully :)Carrageen
@Carrageen Cursed be the devil who invented this rep system! So addictive. I'm spending way too much time here.Lockwood
@PaulPanzer haha yeahh! Tell me about it!Carrageen
Would you have any idea how to do this for a 2d-array, counting on axis 1?Mullock
@Mullock Would need considerable modifications to accommodate for 2D arrays I think. Post a new question?Carrageen
L
2

If a list comprehension is acceptable

np.concatenate([np.cumsum(c) if c[0] == 1 else c for c in np.split(a, 1 + np.where(np.diff(a))[0])])
Lockwood answered 9/2, 2017 at 5:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.