How to count the number of occurences before a particular value in dataframe python?
Asked Answered
T

3

10

I have a dataframe like below:

A   B   C
1   1   1
2   0   1
3   0   0
4   1   0
5   0   1
6   0   0
7   1   0

I want the number of occurence of zeroes from df['B'] under the following condition:

if(df['B']<df['C']):
  #count number of zeroes in df['B'] until it sees 1.

expected output:

A   B   C  output
1   1   1   Nan
2   0   1   1
3   0   0   Nan
4   1   0   Nan
5   0   1   1
6   0   1   0
7   1   0   Nan

I dont know how to formulate the count part. Any help is really appreciated

Tubular answered 13/9, 2019 at 14:11 Comment(0)
R
7

IIUC one approach would be using a custom grouper and aggregating with groupby.cumcount:

c1 = df.B.lt(df.C)
g = df.B.eq(1).cumsum()
df['out'] = c1.groupby(g).cumcount(ascending=False).shift().where(c1).sub(1)

print(df)

   A  B  C  out
0  1  1  1  NaN
1  2  0  1  1.0
2  3  0  0  NaN
3  4  1  0  NaN
4  5  0  1  1.0
5  6  0  1  0.0
6  7  1  0  NaN
Romaromagna answered 13/9, 2019 at 14:24 Comment(0)
Z
7

Using some masking and a groupby on your reversed series. This assumes binary data (only 0 and 1)


m = df['B'][::-1].eq(0)
d = m.groupby(m.ne(m.shift()).cumsum()).cumsum().sub(1)
d[::-1].where(df['B'] < df['C'])

0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
5    0.0
6    NaN
Name: B, dtype: float64

And a fast numpy based approach

def zero_until_one(a, b):
    n = a.shape[0]    
    x = np.flatnonzero(a < b)
    y = np.flatnonzero(a == 1)    
    d = np.searchsorted(y, x)
    r = y[d] - x - 1
    out = np.full(n, np.nan)
    out[x] = r   
    return out

zero_until_one(df['B'], df['C'])

array([nan,  1., nan, nan,  1.,  0., nan])

Performance

df = pd.concat([df]*10_000)

%timeit chris1(df)
19.3 ms ± 348 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit yatu(df)
12.8 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit zero_until_one(df['B'], df['C'])
2.32 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Zebulon answered 13/9, 2019 at 14:21 Comment(1)
Great idea for numpy function , Just guess numba may fasterRigel
R
7

IIUC one approach would be using a custom grouper and aggregating with groupby.cumcount:

c1 = df.B.lt(df.C)
g = df.B.eq(1).cumsum()
df['out'] = c1.groupby(g).cumcount(ascending=False).shift().where(c1).sub(1)

print(df)

   A  B  C  out
0  1  1  1  NaN
1  2  0  1  1.0
2  3  0  0  NaN
3  4  1  0  NaN
4  5  0  1  1.0
5  6  0  1  0.0
6  7  1  0  NaN
Romaromagna answered 13/9, 2019 at 14:24 Comment(0)
R
1

Let us push into one-line

df.groupby(df.B.iloc[::-1].cumsum()).cumcount(ascending=False).shift(-1).where(df.B<df.C)
Out[80]: 
0    NaN
1    1.0
2    NaN
3    NaN
4    1.0
5    0.0
6    NaN
dtype: float64
Rigel answered 13/9, 2019 at 15:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.