Using groupby with expanding and a custom function
Asked Answered
B

1

7

I have a dataframe that consists of truthIds and trackIds:

truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
    trackId truthId
0         1       A
1         1       A
2         2       B
3         2       B
4         3       C
5         4       C
6         5       A
7         3       C
8         2       B
9         1       A
10        5       A
11        4       C
12        6       C

I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:

       truthId  trackId  unique_Ids
0        A        1           1
1        A        1           1
2        B        2           1
3        B        2           1
4        C        3           1
5        C        4           2
6        A        5           2
7        C        3           2
8        B        2           1
9        A        1           2
10       A        5           2
11       C        4           2
12       C        6           3

I am very close to accomplishing this. I can use:

df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

Which produces the following output:

                trackId
truthId            
A       0       1.0
        1       1.0
        6       2.0
        9       2.0
        10      2.0
B       2       1.0
        3       1.0
        8       1.0
C       4       1.0
        5       2.0
        7       2.0
        11      2.0
        12      3.0

This is consistent with the documentation

However, it throws an error when I attempt to assign this output to a new column:

df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?

Blocking answered 6/2, 2018 at 18:51 Comment(0)
E
7

You need reset_index

df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
df
Out[1162]: 
    trackId truthId  Your
0         1       A   1.0
1         1       A   1.0
2         2       B   1.0
3         2       B   1.0
4         3       C   1.0
5         4       C   2.0
6         5       A   2.0
7         3       C   2.0
8         2       B   1.0
9         1       A   2.0
10        5       A   2.0
11        4       C   2.0
12        6       C   3.0
Evonevonne answered 6/2, 2018 at 19:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.