Python Pandas Sum Values in Columns If date between 2 dates

Asked 4/1, 2018 at 21:28 Answered 17/10, 2021 at 11:11

Solved python pandas dataframe pandas-groupby melt

I have a dataframe df which can be created with this:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1 and score2, which creates two columns which SUM the values of score1 and score2 respectively, based on whether the usedate is between date1 and date2. usedate is created by getting all dates between and including the date1 minimum and the date2 maximum. I used this to create the date range:

drange=pd.date_range(df.date1.min(),df.date2.max())

The resulting dataframe newdf should look like:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate 2016-01-01, score1sum is 8, which is calculated by looking at the rows in df where 2016-01-01 is between and including date1 and date2, which sum row0(5) and row8(3). On usedate 2016-01-04, score2sum is 35, which is calculated by looking at the rows in df where 2016-01-04 is between and including date1 and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

Maybe some kind of groupby, or melt then groupby?

Revanche answered 4/1, 2018 at 21:28 Comment(3)

Do you count rows from 0 or 1? There is a confusion in your second example. And in the first example I don't see that row8 is 3. Back to second example, how come 2016-01-04 is between 2017-05-28 and 2017-09-22, and between 2015-11-01 and 2015-11-09? – Schenck 4/1, 2018 at 21:42

Also, your minimum date (df.date1.min()) is 2015-11-01, how can your usedate column start from 2016-01-01? – Schenck 4/1, 2018 at 21:54

@Fatih Akici, much apologies. I had recreated the sample dataframe with fewer dates, and forgot to change what I originally had. I have updated the question with the correct sample. – Revanche 4/1, 2018 at 22:41

You can use apply with lambda function:

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

Abracadabra answered 5/1, 2018 at 4:1 Comment(6)

what is the x.name above referring to? – Revanche 15/1, 2018 at 17:32

x.name is refering to the column the apply is working on. Apply takes each column of a dataframe at a time. Determine which column are you are working you can use x.name to that series name. – Abracadabra 15/1, 2018 at 17:35

im confused, can you please apply the actual column name you used to get the output? I assumed your code would apply to both score1 and score2. – Revanche 15/1, 2018 at 17:39

I am sorry.... since we are using apply with axis=1, x.name is referring the row index of the dataframe we are working on. So, for row1 x.name will be the first date in the index. Apply works on a column or a row at time time depending on the axis. So, each of those are passed in as 'pd.Series' with the index as the name. – Abracadabra 15/1, 2018 at 17:42

No. x.name in this case is referring to the usedate we are building in the dataframe. So when you apply with axis=1. Each row is passed in as a series, the series index are the column headers from the dataframe and the name of the series is the row index of that dataframe. – Abracadabra 15/1, 2018 at 17:50

this works, choosing this but want to see if the below is faster, waiting on @Peter Leimbigler to respond to using transform with a series... – Revanche 15/1, 2018 at 18:30

Method 1: list comprehensions

This is inelegant, but hey, it works! (EDIT: added a second method below.)

# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]

# output
newdf
     usedate  score1sum  score2sum
  2016-01-01          8          8
  2016-01-02         21          6
  2016-01-03         32         13
  2016-01-04         30         35
  2016-01-05         13         26

Method 2: a helper function with `transform` (or `apply`)

newdf = pd.DataFrame(data=drange, columns=['usedate'])

def sum_scores(d):
    return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()

# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)

# newdf is same to above

Timings are comparable

# Jupyter timeit cell magic
%%timeit 
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]

100 loops, best of 3: 10.4 ms per loop

# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores) 

100 loops, best of 3: 8.51 ms per loop

Dendrology answered 5/1, 2018 at 3:44 Comment(6)

when I run this I get'Series Object has no attribute 'transform'. newdf['usedate'] is a series. How did you get the transform to work? – Revanche 15/1, 2018 at 17:53

@Revanche I think you need to upgrade your pandas? What version of pandas are running? – Abracadabra 15/1, 2018 at 18:38

Pandas version 19.2 – Revanche 15/1, 2018 at 21:57

@clg4, that's it: DataFrame.transform was introduced in 0.20.0. – Dendrology 15/1, 2018 at 23:16

in this answer, you have date=drange what is drange? – Tara 8/10, 2020 at 4:40

From OP's question: drange=pd.date_range(df.date1.min(),df.date2.max()) – Dendrology 8/10, 2020 at 15:48

conditional_join from pyjanitor may be helpful in the abstraction/convenience:

# pip install pyjanitor
import pandas as pd
import janitor as jn

drange = pd.DataFrame(drange, columns=['dates'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

(drange.conditional_join(df, 
                         ('dates', 'date1', '>='), 
                         ('dates', 'date2', '<='))
.droplevel(0, 1)
.select_columns('dates', 'score*')
.groupby('dates')
.sum()
.add_suffix('num')
) 
            score1num  score2num
dates                           
2016-01-01          8          8
2016-01-02         21          6
2016-01-03         32         13
2016-01-04         30         35
2016-01-05         13         26

Heterogamy answered 17/10, 2021 at 11:11 Comment(0)

Method 1: list comprehensions

Method 2: a helper function with `transform` (or `apply`)

Timings are comparable

Recommended topics

Hot tags

Method 1: list comprehensions

Method 2: a helper function with transform (or apply)

Timings are comparable

Recommended topics

Hot tags

Method 2: a helper function with `transform` (or `apply`)