How do I perform pandas cumsum while skipping rows that are duplicated in another field?
Asked Answered
G

3

11

I am trying to use the pandas.cumsum() function, but in a way that ignores rows with a value in the ID column that is duplicated and specifically only adds the last value to the cumulative sum, ignoring all earlier values. Example code below (I couldn't share the real code, which is for work).

import pandas as pd, numpy as np
import random as rand
id = ['a','b','c','a','b','e','f','a','b','k']
value = [12,14,3,13,16,7,4,6,10,18]

df = pd.DataFrame({'id':id, 'value':value})
df["cumsum_of_value"] = df['value'].cumsum()
df["desired_output"] = [
    12,26,29,30,32,39,43,36,30,48
]
df["comments"] = [""]*len(df)
df.loc[df.index==0, "comments"]="standard cumsum"
df.loc[df.index==1, "comments"]="standard cumsum"
df.loc[df.index==2, "comments"]="standard cumsum"
df.loc[df.index==3, "comments"]="cumsum of rows 1-3, ignore row 0"
df.loc[df.index==4, "comments"]="cumsum of rows 2-4, ignore rows 0, 1"
df.loc[df.index==5, "comments"]="cumsum of rows 2-5, ignore rows 0, 1"
df.loc[df.index==6, "comments"]="cumsum of rows 2-6, ignore rows 0, 1"
df.loc[df.index==7, "comments"]="cumsum of rows 2,4-7, ignore rows 0, 1, 3"
df.loc[df.index==8, "comments"]="cumsum of rows 2,5-8, ignore rows 0, 1, 3, 4"
df.loc[df.index==9, "comments"]="cumsum of rows 2,5-9, ignore rows 0, 1, 3, 4"
print(df)

In this example, there are seven (7) unique values in the ID column (a, b, c ,d, e, f, g), so the cumsum should only ever sum a max of seven (7) records as its output on any row.

Is this possible using combinations of functions such as cumsum(), groupby(), duplicated(), drop_duplicates(), and avoiding the use of an iterative loop?

I've tried the below

df["duped"] = np.where(df["id"].duplicated(keep='last'),0,1)
df["value_duped"] = df["duped"] * df["value"]
df["desired_output_attempt"] = df["cumsum_of_value"] - df["value_duped"]

But it doesn't come close to the correct answer. I can't think of how to get something like this to result in the desired output without iterating.

Georgiageorgian answered 20/6 at 0:4 Comment(0)
C
8

Try:

df["out"] = (
    df.groupby("id")["value"].transform("diff").fillna(df["value"]).cumsum().astype(int)
)

print(df)

Prints:

  id  value  cumsum_of_value  desired_output  out
0  a     12               12              12   12
1  b     14               26              26   26
2  c      3               29              29   29
3  a     13               42              30   30
4  b     16               58              32   32
5  e      7               65              39   39
6  f      4               69              43   43
7  a      6               75              36   36
8  b     10               85              30   30
9  k     18              103              48   48
Commercial answered 20/6 at 0:27 Comment(2)
I think this is a really good answer, but I have one question(not important): is there a specific reason you used transform?Glyconeogenesis
@PandaKim Yes, I could use .diff(), but transform is just "muscle memory" :)Commercial
G
4

Code

If you don't have too many unique values for id, I think you can use pivot + ffill + sum.

df["desired_output"] = (
    df.pivot(columns='id', values='value').ffill().sum(axis=1).astype('int')
)

df:

  id  value  desired_output
0  a     12              12
1  b     14              26
2  c      3              29
3  a     13              30
4  b     16              32
5  e      7              39
6  f      4              43
7  a      6              36
8  b     10              30
9  k     18              48
Glyconeogenesis answered 20/6 at 0:20 Comment(0)
S
1
out = df.groupby("id")["value"].diff().fillna(0).cumsum().reset_index(drop=True)

df["out"] = out

Try this.

Slavey answered 20/6 at 1:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.