Python Pandas Sum Values in Columns If date between 2 dates
Asked Answered
R

3

8

I have a dataframe df which can be created with this:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1 and score2, which creates two columns which SUM the values of score1 and score2 respectively, based on whether the usedate is between date1 and date2. usedate is created by getting all dates between and including the date1 minimum and the date2 maximum. I used this to create the date range:

drange=pd.date_range(df.date1.min(),df.date2.max())    

The resulting dataframe newdf should look like:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate 2016-01-01, score1sum is 8, which is calculated by looking at the rows in df where 2016-01-01 is between and including date1 and date2, which sum row0(5) and row8(3). On usedate 2016-01-04, score2sum is 35, which is calculated by looking at the rows in df where 2016-01-04 is between and including date1 and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

Maybe some kind of groupby, or melt then groupby?

Revanche answered 4/1, 2018 at 21:28 Comment(3)
Do you count rows from 0 or 1? There is a confusion in your second example. And in the first example I don't see that row8 is 3. Back to second example, how come 2016-01-04 is between 2017-05-28 and 2017-09-22, and between 2015-11-01 and 2015-11-09?Schenck
Also, your minimum date (df.date1.min()) is 2015-11-01, how can your usedate column start from 2016-01-01?Schenck
@Fatih Akici, much apologies. I had recreated the sample dataframe with fewer dates, and forgot to change what I originally had. I have updated the question with the correct sample.Revanche
A
6

You can use apply with lambda function:

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26
Abracadabra answered 5/1, 2018 at 4:1 Comment(6)
what is the x.name above referring to?Revanche
x.name is refering to the column the apply is working on. Apply takes each column of a dataframe at a time. Determine which column are you are working you can use x.name to that series name.Abracadabra
im confused, can you please apply the actual column name you used to get the output? I assumed your code would apply to both score1 and score2.Revanche
I am sorry.... since we are using apply with axis=1, x.name is referring the row index of the dataframe we are working on. So, for row1 x.name will be the first date in the index. Apply works on a column or a row at time time depending on the axis. So, each of those are passed in as 'pd.Series' with the index as the name.Abracadabra
No. x.name in this case is referring to the usedate we are building in the dataframe. So when you apply with axis=1. Each row is passed in as a series, the series index are the column headers from the dataframe and the name of the series is the row index of that dataframe.Abracadabra
this works, choosing this but want to see if the below is faster, waiting on @Peter Leimbigler to respond to using transform with a series...Revanche
D
1

Method 1: list comprehensions

This is inelegant, but hey, it works! (EDIT: added a second method below.)

# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]

# output
newdf
     usedate  score1sum  score2sum
  2016-01-01          8          8
  2016-01-02         21          6
  2016-01-03         32         13
  2016-01-04         30         35
  2016-01-05         13         26

Method 2: a helper function with transform (or apply)

newdf = pd.DataFrame(data=drange, columns=['usedate'])

def sum_scores(d):
    return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()

# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)

# newdf is same to above

Timings are comparable

# Jupyter timeit cell magic
%%timeit 
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]

100 loops, best of 3: 10.4 ms per loop

# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores) 

100 loops, best of 3: 8.51 ms per loop
Dendrology answered 5/1, 2018 at 3:44 Comment(6)
when I run this I get'Series Object has no attribute 'transform'. newdf['usedate'] is a series. How did you get the transform to work?Revanche
@Revanche I think you need to upgrade your pandas? What version of pandas are running?Abracadabra
Pandas version 19.2Revanche
@clg4, that's it: DataFrame.transform was introduced in 0.20.0.Dendrology
in this answer, you have date=drange what is drange?Tara
From OP's question: drange=pd.date_range(df.date1.min(),df.date2.max())Dendrology
H
0

conditional_join from pyjanitor may be helpful in the abstraction/convenience:

# pip install pyjanitor
import pandas as pd
import janitor as jn

drange = pd.DataFrame(drange, columns=['dates'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

(drange.conditional_join(df, 
                         ('dates', 'date1', '>='), 
                         ('dates', 'date2', '<='))
.droplevel(0, 1)
.select_columns('dates', 'score*')
.groupby('dates')
.sum()
.add_suffix('num')
) 
            score1num  score2num
dates                           
2016-01-01          8          8
2016-01-02         21          6
2016-01-03         32         13
2016-01-04         30         35
2016-01-05         13         26

Heterogamy answered 17/10, 2021 at 11:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.