pandas: function equivalent to SQL's datediff()?
Asked Answered
C

3

5

Is there an equivalent to SQL's datediff function in Python's pandas? The answer to this question: Add column with number of days between dates in DataFrame pandas explains how to calculate the difference in days. For example:

>>> (pd.to_datetime('15-10-2010') - pd.to_datetime('15-07-2010')) / pd.offsets.Day(1)
92.0

However, I have two questions:

  1. Is there a way to calculate the difference in months? I can approximate dividing the result above by 30, by 31, but I was wondering if there is some built-in function that does this automatically.
  2. what is the syntax of pd.offsets? I tried dividing by pd.offsets.Month(1) and it doesn't work. I looked up the documentation here (which is atrocious, like all of Python's documentation!): http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects but Day() is not there, so I'm confused
Confute answered 6/5, 2016 at 17:37 Comment(2)
Also, I cannot seem to convert the result into an integer. Any ideas?Confute
Mmm, I found here #22133025 how to convert from timedelta to float: divide by np.timedelta64(1,'D'). Why, and where it is documented, are separate questions!!!Confute
P
4

UPDATE:

def months_between(d1, d2):
    dd1 = min(d1, d2)
    dd2 = max(d1, d2)
    return (dd2.year - dd1.year)*12 + dd2.month - dd1.month

In [125]: months_between(pd.to_datetime('2015-01-02 12:13:14'), pd.to_datetime('2012-03-02 12:13:14'))
Out[125]: 34

OLD answer:

In [40]: (pd.to_datetime('15-10-2010') - pd.to_datetime('15-07-2010')).days
Out[40]: 92

you can also do this for months:

In [48]: pd.to_datetime('15-10-2010').month - pd.to_datetime('15-07-2010').month
Out[48]: 3
Pepe answered 6/5, 2016 at 17:39 Comment(7)
Super! Thanks a lot! Am I super clumsy or is pandas documentation really atrocious?Confute
Actually, no: this doesn't work with different years, eg the following still returns 3: pd.to_datetime('15-10-2015').month - pd.to_datetime('15-07-2010').monthConfute
This would be because you are only using the .month property, not really taking into consideration that there are several years in between the two dates.Pool
@maxU so does Matlab's, yet its documentation is absolutely superb. I never had to spend hours and hours juist looking for answers to basic questions; time is money, and if Matlab were as good as Pandas in handling non-numerical data the cost of the licence would have paid for itself alreadyConfute
@Thanos, yes, I gathered as much :)Confute
@Pythonistaanonymous, sorry, i couldn't find any convenient way how to convert it to timedelta in months. Maybe just count delta in years_delta*12 + months_delta ?Pepe
@Pythonistaanonymous, i've updated my answer - please checkPepe
P
3

If you look around a little, it seems that months is not possible to get out of a TimeDelta:

In [193]: date_1 = pd.to_datetime('2015-01-02 12:13:14')

In [194]: date_2 = pd.to_datetime('2012-03-02 12:13:14')

In [195]: date_1 - date_2
Out[195]: Timedelta('1036 days 00:00:00')

In [199]: td_1.
td_1.asm8            td_1.days            td_1.freq            td_1.microseconds    td_1.resolution      td_1.to_pytimedelta  td_1.value           
td_1.ceil            td_1.delta           td_1.is_populated    td_1.min             td_1.round           td_1.to_timedelta64  td_1.view            
td_1.components      td_1.floor           td_1.max             td_1.nanoseconds     td_1.seconds         td_1.total_seconds

In [199]: td_1.components
Out[199]: Components(days=1036, hours=0, minutes=0, seconds=0, milliseconds=0, microseconds=0, nanoseconds=0)

Additionally, Components are not offering different denominations of the same value seemingly, but

In [213]: td_1.components.days
Out[213]: 1036

In [214]: td_1.components.hours
Out[214]: 0

Ultimately, it seems that what you have been doing until now seems like the "best" solution:

In [214]: td_1.components.days/30
Out[214]: 34.53333333333333

In [215]: np.round(td_1.components.days/30)
Out[215]: 35.0

In [216]: np.floor(td_1.components.days/30)
Out[216]: 34.0

Not the great news really, but a solution in any case.

As to comparing the documentation that Matlab comes with to this of pandas, you are right. However, if you were to compare the price tag of the two as well maybe some questions are answered.. (?)

Pool answered 6/5, 2016 at 18:9 Comment(1)
Sure, but my point is that, in a business environment, time is money, and the 'free'price of a poorly documented product is a false economy. I have just wasted about an hour trying to figure out how to subtract dates. If there existed a commercial software with roughly the functionalities of pandas and the quality of documentation of Matlab, i'd buy it in a heartbeatConfute
K
0

There is also pd.Period, which you can use to get the difference for any time period.

In [1]: d1 = pd.Period('2022-08-31', 'M')
        d1
Out[1]: Period('2022-08', 'M')

In [2]: d2 = pd.Period('2020-03-25', 'M')
        d2
Out[2]: Period('2020-03', 'M')

In [3]: diff = d1 - d2
        diff
Out[3]: <29 * MonthEnds>

In [4]: diff.n
Out[4]: 29

Here's a generic function that you can use with apply, to turn any date column into your desired period:

def keep_period(x, period='M'):

    if pd.isna(x):
        return x
    else:
        return pd.to_datetime(x).to_period(period)

and to call it:

df['date_column_period'] = df['date_column'].apply(keep_period, args=('Y', ))
Karren answered 8/9, 2022 at 7:30 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.