How to generate sequence considering NaN in pandas
Asked Answered
W

4

9

I have a series that contains NaN and True as a value. I want another series to generate a sequence of number, such that whenever NaN comes put that series value as 0 and In between of Two NaN rows I need to perform cumcount.

i.e.,

Input:

colA
NaN
True
True
True
True
NaN
True
NaN
NaN
True
True
True
True
True

Output

ColA    Sequence
NaN     0
True    0
True    1
True    2
True    3
NaN     0
True    0
NaN     0
NaN     0
True    0
True    1
True    2
True    3
True    4

How to perform this in pandas?

Wormseed answered 7/2, 2018 at 8:19 Comment(2)
And what have you tried thus far?Bathsheb
groupby and fillnaWormseed
B
8

You could use groupby + cumcount + mask here:

m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)

Or, use clip_lower in the last step and you don't have to pre-cache m:

df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)

df

    colA  Sequence
0    NaN         0
1   True         0
2   True         1
3   True         2
4   True         3
5    NaN         0
6   True         0
7    NaN         0
8    NaN         0
9   True         0
10  True         1
11  True         2
12  True         3
13  True         4

Timings

df = pd.concat([df] * 10000, ignore_index=True)

# Timing the alternatives in this answer

%%timeit
m = df.colA.isnull()
df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)

23.3 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)

24.1 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user2314737's solution

%%timeit
df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()

29.8 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @jezrael's solution

%%timeit
a = df['colA'].isnull()
b = a.cumsum()
(b-b.where(~a).add(1).ffill().fillna(0).astype(int)).clip_lower(0)

11.5 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note, your mileage may vary, depending on the data.

Belicia answered 7/2, 2018 at 8:23 Comment(7)
Thanks for doing the timings--I was going to do it too but you were faster!Psilocybin
@June Done :)Belicia
@Psilocybin No problem, I like being thorough with my answers!Belicia
@June No problem. Nice answer by the way ;)Belicia
@June also better than mine, I get 800ms :|Psilocybin
@cᴏʟᴅsᴘᴇᴇᴅ, i see my pure numpy solution coming in at 15ms, which is comparable to these. but i expected faster.Alodium
@jp_data_analysis I mean, yeah, numpy is always going to be slightly faster than pandas solutions. But once you have the basic idea down, it's easy to translate one to the other for a performance boost. For example, jezrael's answer can be optimised a bit for better performance using numpy functions, etc. Yours is a nice answer still, upvoted it.Belicia
J
11

If performace is important better is not use groupby for count consecutive Trues:

a = df['colA'].notnull()
b = a.cumsum()
df['Sequence'] = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
print (df)
    colA  Sequence
0    NaN         0
1   True         0
2   True         1
3   True         2
4   True         3
5    NaN         0
6   True         0
7    NaN         0
8    NaN         0
9   True         0
10  True         1
11  True         2
12  True         3
13  True         4

Explanation:

df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
                           True,np.nan,np.nan,True,True,True,True,True]})

a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)

#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
        a   b    c    d  e  f  g
0   False   0  0.0  1.0  1 -1  0
1    True   1  NaN  NaN  1  0  0
2    True   2  NaN  NaN  1  1  1
3    True   3  NaN  NaN  1  2  2
4    True   4  NaN  NaN  1  3  3
5   False   4  4.0  5.0  5 -1  0
6    True   5  NaN  NaN  5  0  0
7   False   5  5.0  6.0  6 -1  0
8   False   5  5.0  6.0  6 -1  0
9    True   6  NaN  NaN  6  0  0
10   True   7  NaN  NaN  6  1  1
11   True   8  NaN  NaN  6  2  2
12   True   9  NaN  NaN  6  3  3
13   True  10  NaN  NaN  6  4  4
June answered 7/2, 2018 at 8:27 Comment(0)
B
8

You could use groupby + cumcount + mask here:

m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)

Or, use clip_lower in the last step and you don't have to pre-cache m:

df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)

df

    colA  Sequence
0    NaN         0
1   True         0
2   True         1
3   True         2
4   True         3
5    NaN         0
6   True         0
7    NaN         0
8    NaN         0
9   True         0
10  True         1
11  True         2
12  True         3
13  True         4

Timings

df = pd.concat([df] * 10000, ignore_index=True)

# Timing the alternatives in this answer

%%timeit
m = df.colA.isnull()
df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)

23.3 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)

24.1 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user2314737's solution

%%timeit
df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()

29.8 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @jezrael's solution

%%timeit
a = df['colA'].isnull()
b = a.cumsum()
(b-b.where(~a).add(1).ffill().fillna(0).astype(int)).clip_lower(0)

11.5 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note, your mileage may vary, depending on the data.

Belicia answered 7/2, 2018 at 8:23 Comment(7)
Thanks for doing the timings--I was going to do it too but you were faster!Psilocybin
@June Done :)Belicia
@Psilocybin No problem, I like being thorough with my answers!Belicia
@June No problem. Nice answer by the way ;)Belicia
@June also better than mine, I get 800ms :|Psilocybin
@cᴏʟᴅsᴘᴇᴇᴅ, i see my pure numpy solution coming in at 15ms, which is comparable to these. but i expected faster.Alodium
@jp_data_analysis I mean, yeah, numpy is always going to be slightly faster than pandas solutions. But once you have the basic idea down, it's easy to translate one to the other for a performance boost. For example, jezrael's answer can be optimised a bit for better performance using numpy functions, etc. Yours is a nice answer still, upvoted it.Belicia
P
3

Try this:

df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()

Full example:

>>> df = pd.DataFrame({'colA':[np.NaN,True,True,True,True,np.NaN,True,np.NaN,np.NaN,True,True,True,True,True]})
>>> df['Sequence']=df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
>>> df
    colA  Sequence
0    NaN         0
1   True         0
2   True         1
3   True         2
4   True         3
5    NaN         0
6   True         0
7    NaN         0
8    NaN         0
9   True         0
10  True         1
11  True         2
12  True         3
13  True         4
Psilocybin answered 7/2, 2018 at 8:27 Comment(0)
A
2

Late to the party, but here's a numpy solution wrapped in a function:

import pandas as pd, numpy as np

df = pd.DataFrame({'ColA': [np.nan, True, True, True, True, np.nan, True,
                            np.nan, np.nan, True, True, True, True, True]})

def return_cumsum(df):
    v = np.array(df.ColA, dtype=float)
    n = np.isnan(v)
    v[n] = -np.diff(np.concatenate(([0.], np.cumsum(~n)[n])))
    df['Sequence'] = np.array(np.maximum(0, np.cumsum(v)-1), dtype=int)
    return df

df = return_cumsum(df)

#     ColA  Sequence
# 0    NaN         0
# 1   True         0
# 2   True         1
# 3   True         2
# 4   True         3
# 5    NaN         0
# 6   True         0
# 7    NaN         0
# 8    NaN         0
# 9   True         0
# 10  True         1
# 11  True         2
# 12  True         3
# 13  True         4
Alodium answered 8/2, 2018 at 0:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.