Cumsum reset at NaN

D

6

22

If I have a pandas.core.series.Series named ts of either 1's or NaN's like this:

3382   NaN
3381   NaN
...
3369   NaN
3368   NaN
...
15     1
10   NaN
11     1
12     1
13     1
9    NaN
8    NaN
7    NaN
6    NaN
3    NaN
4      1
5      1
2    NaN
1    NaN
0    NaN

I would like to calculate cumsum of this serie but it should be reset (set to zero) at the location of the NaNs like below:

Ideally I would like to have a vectorized solution !

I ever see a similar question with Matlab : Matlab cumsum reset at NaN?

but I don't know how to translate this line d = diff([0 c(n)]);

Dorie answered 12/8, 2013 at 21:14 Comment(0)

L

13

A simple Numpy translation of your Matlab code is this:

import numpy as np

v = np.array([1., 1., 1., np.nan, 1., 1., 1., 1., np.nan, 1.])
n = np.isnan(v)
a = ~n
c = np.cumsum(a)
d = np.diff(np.concatenate(([0.], c[n])))
v[n] = -d
np.cumsum(v)

Executing this code returns the result array([ 1., 2., 3., 0., 1., 2., 3., 4., 0., 1.]). This solution will only be as valid as the original one, but maybe it will help you come up with something better if it isn't sufficient for your purposes.

Layby answered 12/8, 2013 at 21:46 Comment(2)

it wouldn't work with: v = np.array([1., 2., 4., np.nan, 1., 3., 1., 3., np.nan, 1.]) – Lacteal 17/9, 2020 at 7:30

If you change a = ~n to a = np.nan_to_num(v), it also works for v with values other than 1. – Bollix 30/6, 2022 at 9:28

T

15

Even more pandas-onic way to do it:

v = pd.Series([1., 3., 1., np.nan, 1., 1., 1., 1., np.nan, 1.])
cumsum = v.cumsum().ffill()
reset = -cumsum[v.isna()].diff().fillna(cumsum)
result = v.where(v.notna(), reset).cumsum()

Contrary to the matlab code, this also works for values different from 1.

Te answered 5/4, 2016 at 20:16 Comment(1)

This is the best answer of the lot. If you want to understand how it works, just add one more line at the end: print(pd.DataFrame({'v': v, 'cum': cumsum, 'reset': reset, 'result': result})), and run this code. – Flan 13/11, 2018 at 10:10

L

13

A simple Numpy translation of your Matlab code is this:

import numpy as np

v = np.array([1., 1., 1., np.nan, 1., 1., 1., 1., np.nan, 1.])
n = np.isnan(v)
a = ~n
c = np.cumsum(a)
d = np.diff(np.concatenate(([0.], c[n])))
v[n] = -d
np.cumsum(v)

Executing this code returns the result array([ 1., 2., 3., 0., 1., 2., 3., 4., 0., 1.]). This solution will only be as valid as the original one, but maybe it will help you come up with something better if it isn't sufficient for your purposes.

Layby answered 12/8, 2013 at 21:46 Comment(2)

it wouldn't work with: v = np.array([1., 2., 4., np.nan, 1., 3., 1., 3., np.nan, 1.]) – Lacteal 17/9, 2020 at 7:30

If you change a = ~n to a = np.nan_to_num(v), it also works for v with values other than 1. – Bollix 30/6, 2022 at 9:28

K

10

Here's a slightly more pandas-onic way to do it:

v = Series([1, 1, 1, nan, 1, 1, 1, 1, nan, 1], dtype=float)
n = v.isnull()
a = ~n
c = a.cumsum()
index = c[n].index  # need the index for reconstruction after the np.diff
d = Series(np.diff(np.hstack(([0.], c[n]))), index=index)
v[n] = -d
result = v.cumsum()

Note that either of these requires that you're using pandas at least at 9da899b or newer. If you aren't then you can cast the bool dtype to an int64 or float64 dtype:

v = Series([1, 1, 1, nan, 1, 1, 1, 1, nan, 1], dtype=float)
n = v.isnull()
a = ~n
c = a.astype(float).cumsum()
index = c[n].index  # need the index for reconstruction after the np.diff
d = Series(np.diff(np.hstack(([0.], c[n]))), index=index)
v[n] = -d
result = v.cumsum()

Kapor answered 12/8, 2013 at 21:54 Comment(9)

ValueError: cannot convert float NaN to integer for ts.notnull.cumsum() on pandas 0.12. I'm not sure why this would occur for a boolean series.. – Susiesuslik 12/8, 2013 at 22:4

That should've been fixed by 9da899b – Kapor 12/8, 2013 at 22:8

@Closed Make sure you're up to date and let me know if it still doesn't work. – Kapor 12/8, 2013 at 22:8

@Closed I've updated my answer for usage pre 9da899b. – Kapor 12/8, 2013 at 22:12

Thanks for your answer. @nosuchthingasstars 's answer is marked as solving this issue... but I also like your answer! You should write ts = pd.Series(np.random.randint(10, size=1000), dtype=float) – Dorie 13/8, 2013 at 6:40

I think there is a mistake

import pandas as pd import numpy as np  #ts = pd.Series(np.random.randint(10, size=1000), dtype=float) #ts[np.random.rand(len(ts)) > 0.5] = np.nan ts = pd.Series([1,1,1,np.nan,np.nan,1,1])  valid = ts.notnull() #c = valid.cumsum() # with pd>=0.12 c = valid.astype(int).cumsum() # because of Pandas bug with bool and cumsum for pd<0.12 d = c.diff().fillna(0) ts[~valid] = d

because ts= 1 1 1 0 0 1 1 it should output 1 2 3 0 0 1 2 – Dorie 13/8, 2013 at 7:40

@Dorie I've updated my answer to give the correct answer. – Kapor 13/8, 2013 at 16:8

Thanks but as you can see, it's very close from nosuchthingasstars solution. – Dorie 13/8, 2013 at 19:8

Will that answer maintain the index? – Kapor 13/8, 2013 at 20:36

S

6

If you can accept a similar boolean Series b, try

(b.cumsum() - b.cumsum().where(~b).fillna(method='pad').fillna(0)).astype(int)

Starting from your Series ts, either b = (ts == 1) or b = ~ts.isnull().

Salado answered 10/8, 2015 at 6:16 Comment(0)

C

1

You can do that with expanding().apply and replace with method='backfill'

reset_at = 0

ts.expanding().apply(
    lambda s:
        s[
            (s != reset_at).replace(True, method='backfill')
        ].sum()
).fillna(0)

Confidant answered 25/3, 2022 at 16:3 Comment(0)

L

0

Another way is:

cond = ts.isnull()
groups = (cond != cond.shift()).cumsum()
result = ts.groupby(groups).apply(lambda x: x.cumsum()).fillna(0).convert_dtypes()
result.index = result.index.get_level_values(1)

or more simplified:

cond = ts.isnull()
groups = (cond != cond.shift()).cumsum()
result = ts.groupby(groups).agg('cumsum').fillna(0).convert_dtypes()

It works by first checking whether each value matches the condition of group delimiter (cond). Later it compares this condition with the previous one to mark which element starts a new group (cond != cond.shift()), the cumsum there assigns a new group value to each group.

After that it just group by the group number and perform the desired operation (cumsum) inside each group. I like this way because it clearly delimits the condition for the group delimiters and the the operation to perform.

Lockage answered 8/1 at 21:33 Comment(0)

Recommended topics

Hot tags