Python Pandas dataframe subtract cumulative column
Asked Answered
S

4

7

I have some data that I am importing into a Pandas dataframe. This data is "cumulative" and indexed on a time series, see below:

                        Raw data
2016-11-23 10:00:00     48.6 
2016-11-23 11:00:00     158.7 
2016-11-23 12:00:00     377.8 
2016-11-23 13:00:00     591.7 
2016-11-23 14:00:00     748.5 
2016-11-23 15:00:00     848.2 

The data is updated daily, so the time series will move forward a day each day.

What I need to do is to take this dataframe and create a new column as shown below. The first row simply copies the data from the "Raw data" column. Then each subsequent row takes the data from the "Raw data" column, and subtracts the value that appeared before it, e.g. 158.7 - 48.6 = 110.1, 377.8 - 158.7 = 219.1, etc.

Does anyone know how I can achieve what is in the "Process data" column in Python/Pandas?

                    Raw data    Processed data
23/11/2016 10:00    48.6        48.6
23/11/2016 11:00    158.7       110.1
23/11/2016 12:00    377.8       219.1
23/11/2016 13:00    591.7       213.9
23/11/2016 14:00    748.5       156.8
23/11/2016 15:00    848.2       99.7
Specify answered 24/11, 2016 at 12:38 Comment(1)
Do you need the "Processed data" column to reset to the first value each day, or should it be one long stream?Scan
B
17

You can use substract by sub with shifted column:

Last fill NaN by first value in Raw data.

df['Processed data'] = df['Raw data'].sub(df['Raw data'].shift())
df['Processed data'].iloc[0] = df['Raw data'].iloc[0]
print (df)
                     Raw data  Processed data
2016-11-23 10:00:00      48.6            48.6
2016-11-23 11:00:00     158.7           110.1
2016-11-23 12:00:00     377.8           219.1
2016-11-23 13:00:00     591.7           213.9
2016-11-23 14:00:00     748.5           156.8
2016-11-23 15:00:00     848.2            99.7
Busch answered 24/11, 2016 at 12:42 Comment(0)
V
10

I think you can try this one, i found it easy : Will create a new column with subtracted value.

df['processed_data'] = df['Raw_data'].diff(1)
Vitkun answered 26/2, 2019 at 6:2 Comment(0)
F
0

With recent versions of Pandas, you can use diff and fillna:

df['Processed data'] = df['Raw data'].diff().fillna(df['Raw data'])
print(df)

# Output
                     Raw data  Processed data
2016-11-23 10:00:00      48.6            48.6
2016-11-23 11:00:00     158.7           110.1
2016-11-23 12:00:00     377.8           219.1
2016-11-23 13:00:00     591.7           213.9
2016-11-23 14:00:00     748.5           156.8
2016-11-23 15:00:00     848.2            99.7
Finely answered 8/6, 2023 at 7:22 Comment(0)
H
-1

You can do this with join in Pandas, which has the advantage of handling cases with more columns (which are not themselves unique).

Assume you have a DataFrame like

timestep                fid        cumul_value
2016-11-23 10:00:00     1          48.6 
2016-11-23 11:00:00     1          158.7 
2016-11-23 12:00:00     1          377.8 
2016-11-23 13:00:00     1          591.7 
2016-11-23 14:00:00     1          748.5 
2016-11-23 15:00:00     1          848.2 
2016-11-23 10:00:00     2          88.6 
2016-11-23 11:00:00     2          758.7 
...
2016-11-23 12:00:00     5          577.8 
2016-11-23 13:00:00     5          691.7 
2016-11-23 14:00:00     5          348.5 
2016-11-23 15:00:00     5          148.2 

where fid represents another parameter, over whose values cumul_value varies. You want to obtain a column value from the column cumul_value such that value(fid,timestep) = cumul_value(fid,timestep) - cumul_value(fid,timestep - 1) for each fid.

onestep = timedelta(hours=1)
df['prev_timestep'] = df['timestep'] - onestep
df_cumul = df[['id','fid','timestep','cumul_value']]
        .set_index(['timestep','fid'])
df_val = df.join(df_cumul,on=['prev_timestep','fid'],rsuffix='_prev')
df_val['value'] = df_val['cumul_value'] - df_val['cumul_value_prev']
df_val = df_val.drop(['prev_timestep','cumul_value_prev','cumul_value','id_prev'],axis=1)

Finish up by handling the first timestep which is likely a special case(call it t0)

df_t0 = df_cumul[df_cumul['timestep'] == t0]
df_val.loc[df_val.index.isin(df_t0.index),'value'] = df_t0['cumul_value']
Heretical answered 15/5, 2017 at 5:44 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.