Merge pandas dataframes where one value is between two others [duplicate]

Asked 3/6, 2015 at 18:33 Answered 4/8, 2017 at 15:32

Solved python pandas join timespan date-range

I need to merge two pandas dataframes on an identifier and a condition where a date in one dataframe is between two dates in the other dataframe.

Dataframe A has a date ("fdate") and an ID ("cusip"):

enter image description here

I need to merge this with this dataframe B:

enter image description here

on A.cusip==B.ncusip and A.fdate is between B.namedt and B.nameenddt.

In SQL this would be trivial, but the only way I can see how to do this in pandas is to first merge unconditionally on the identifier, and then filter on the date condition:

df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip')
df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])]

Is this really the best way to do this? It seems that it would be much better if one could filter within the merge so as to avoid having a potentially very large dataframe after the merge but before the filter has completed.

Rolfrolfe answered 3/6, 2015 at 18:33 Comment(10)

To be honest that would be the way I'd do it, the merge params allow only exact matches on index or columns not using a special criteria like you want to the best of my knowledge – Pirali 3/6, 2015 at 18:35

@EdChum, do you think it would make sense to put a feature request for this on GitHub? Do you think there some deep reason why this isn't already a feature? – Rolfrolfe 3/6, 2015 at 18:41

To some extent one should not expect SQL like functionality when it comes to dfs especially as you can filter the dfs before/after the merge, you could add a request as I'm not a dev but they are very responsive though – Pirali 3/6, 2015 at 18:45

If it hasn't already been done, a feature request at github sounds like a good idea to me. This sort of question comes up here fairly frequently and I've never seen a really good answer -- good as in easy to do with pandas. – Miserly 3/6, 2015 at 20:2

I guess the problem here is to define a sensible api for this, currently there are named args and if you start to allow things like less than, greater than etc..then it could become messy, otherwise you could allow a user string which could be evaluated but even that could be cumbersome as the order may matter when it comes to the comparisons, that's my thoughts on this – Pirali 3/6, 2015 at 20:24

I think the problem is always about points in time and timespans and supporting operations with these kinds of objects on join would allow for lots of new solutions. – Wordsworth 4/6, 2015 at 8:28

For completeness, I opened an issue on GitHub: github.com/pydata/pandas/issues/10309 – Rolfrolfe 8/6, 2015 at 8:20

I have the same issue and in the past have used the foverlaps function in R to accomplish this. Still have yet to find a solution. See this question for more info: #40247595 – Neanderthal 16/11, 2016 at 23:22

I think the code with the unconditional merge then filter has a bug in that it would drop observations which merge to the ID, but not also to the date interval – Althaalthea 19/2, 2019 at 19:6

pandas.pydata.org/pandas-docs/stable/reference/api/… You can do a backward merge and forward merge – Fantasist 28/8, 2020 at 9:14

As you say, this is pretty easy in SQL, so why not do it in SQL?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

df:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

Lubalubba answered 14/3, 2017 at 20:49 Comment(8)

would you mind providing some benchmarking for how this solution performs vs. other solutions? This answer feels like it does not answer the question at all, more like circumvents the question, unless this actually is a performant solution – Wordsworth 15/3, 2017 at 8:12

The right tool for the job. – Stockstill 8/6, 2017 at 8:58

I used this idea today for a related problem and it worked well. Worth noting that for a large data set, you will likely get a significant speed increase by creating an index for any column(s) used in the join condition. – Skerl 16/4, 2019 at 14:11

I just used Google Advanced search for so long to re-find this answer. this solves countless other issues and I'll be using it across many projects. What a great answer. – Emphysema 9/7, 2019 at 13:41

I found this solution to be >10 times faster than the one here: https://mcmap.net/q/218746/-how-to-join-two-dataframes-for-which-column-values-are-within-a-certain-range – Chavis 24/1, 2020 at 2:3

I don't think this answers the question, as this is asking the server to handle the join, whereas the question is on how to get Python pandas to do it. This matters because there are a gazillion use cases for this where the data are in two difference locations. – Zabrina 19/2, 2020 at 20:47

I'm happy with this answer. Just checking (almost 2 years later) if this has been integrated into pandas? Seems like a good idea. – Meuser 21/4, 2022 at 10:21

The solution here works, but it's a little painful because of some non-standard type in my dataframe, which requires some wrapers to get this working. – Meuser 21/4, 2022 at 10:22

You should be able to do this now using the package pandasql

import pandasql as ps

sqlcode = '''
select A.cusip
from A
inner join B on A.cusip=B.ncusip
where A.fdate >= B.namedt and A.fdate <= B.nameenddt
group by A.cusip
'''

newdf = ps.sqldf(sqlcode,locals())

I think the answer from @ChuHo is good. I believe pandasql is doing the same for you. I haven't benchmarked the two, but it is easier to read.

Cogency answered 4/8, 2017 at 15:32 Comment(2)

why not include the where criteria in the ON criteria: FROM A INNER JOIN B ON A.cusip = B.ncusip AND A.fdate>= B.namedt AND fdate <= B.nameenddt GROUP BY A.cusip ? – Ibanez 23/3, 2018 at 18:26

@WaelHussein I believe an inner join means it doesn't matter. You should do the above it is an outer join. – Cogency 24/3, 2018 at 19:8

There is no pandamic way of doing this at the moment.

This answer used to be about tackling the problem with polymorphism, which tured out to be a very bad idea.

Then the numpy.piecewise function appeared in another answer, but with little explanation, so I thought I would clarify how this function can be used.

Numpy way with piecewise (Memory heavy)

The np.piecewise function can be used to generate the behavior of a custom join. There is a lot of overhead involved and it is not very efficient perse, but it does the job.

Producing conditions for joining

import pandas as pd
from datetime import datetime


presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})

start_end_date_tuples = zip(terms.start_date.values, terms.end_date.values)
conditions = [(war_declarations.date.values >= start_date) &
              (war_declarations.date.values <= end_date) for start_date, end_date in start_end_date_tuples]

> conditions
[array([ True,  True], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool),
 array([False, False], dtype=bool)]

This is a list of arrays where each array tells us if the term time span matched for each of the two war declarations we have. The conditions can explode with larger datasets as it will be the length of the left df and the right df multiplied.

The piecewise "magic"

Now piecewise will take the president_id from the terms and place it in the war_declarations dataframe for each of the corresponding wars.

war_declarations['president_id'] = np.piecewise(np.zeros(len(war_declarations)),
                                                conditions,
                                                terms.president_id.values)
    date        name                president_id
0   2001-09-14  War in Afghanistan          43.0
1   2003-03-03  Iraq War                    43.0

Now to finish this example we just need to regularly merge in the presidents' name.

war_declarations.merge(presidents, on="president_id", suffixes=["_war", "_president"])

    date        name_war            president_id    name_president
0   2001-09-14  War in Afghanistan          43.0    Bush
1   2003-03-03  Iraq War                    43.0    Bush

Polymorphism (does not work)

I wanted to share my research efforts, so even if this does not solve the problem, I hope it will be allowed to live on here as a useful reply at least. Since it is hard to spot the error, someone else may try this and think they have a working solution, while in fact, they don't.

The only other way I could figure out is to create two new classes, one PointInTime and one Timespan

Both should have __eq__ methods where they return true if a PointInTime is compared to a Timespan which contains it.

After that you can fill your DataFrame with these objects, and join on the columns they live in.

Something like this:

class PointInTime(object):

    def __init__(self, year, month, day):
        self.dt = datetime(year, month, day)

    def __eq__(self, other):
        return other.start_date < self.dt < other.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)

class Timespan(object):
    def __init__(self, start_date, end_date):
        self.start_date = start_date
        self.end_date = end_date

    def __eq__(self, other):
        return self.start_date < other.dt < self.end_date

    def __ne__(self, other):
        return not self.__eq__(other)

    def __repr__(self):
        return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day,
                                             self.end_date.year, self.end_date.month, self.end_date.day)

Important note: I do not subclass datetime because pandas will consider the dtype of the column of datetime objects to be a datetime dtype, and since the timespan is not, pandas silently refuses to merge on them.

If we instantiate two objects of these classes, they can now be compared:

pit = PointInTime(2015,1,1)
ts = Timespan(datetime(2014,1,1), datetime(2015,2,2))
pit == ts
True

We can also fill two DataFrames with these objects:

df = pd.DataFrame({"pit":[PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3)]})

df2 = pd.DataFrame({"ts":[Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1))]})

And then the merging kind of works:

pd.merge(left=df, left_on='pit', right=df2, right_on='ts')

        pit                    ts
0  2015-2-2  2015-2-1 -> 2015-2-5
1  2015-2-2  2015-2-1 -> 2015-4-1

But only kind of.

PointInTime(2015,3,3) should also have been included in this join on Timespan(datetime(2015,2,1), datetime(2015,4,1))

But it is not.

I figure pandas compares PointInTime(2015,3,3) to PointInTime(2015,2,2) and makes the assumption that since they are not equal, PointInTime(2015,3,3) cannot be equal to Timespan(datetime(2015,2,1), datetime(2015,4,1)), since this timespan was equal to PointInTime(2015,2,2)

Sort of like this:

Rose == Flower
Lilly != Rose

Therefore:

Lilly != Flower

Edit:

I tried to make all PointInTime equal to each other, this changed the behaviour of the join to include the 2015-3-3, but the 2015-2-2 was only included for the Timespan 2015-2-1 -> 2015-2-5, so this strengthens my above hypothesis.

If anyone has any other ideas, please comment and I can try it.

Wordsworth answered 3/6, 2015 at 21:16 Comment(4)

__neq__ doesn't work, you need __ne__ – Toponym 26/4, 2018 at 21:34

Reading this in 2020, "pandamic" is so reminiscent of "pandemic". Nice answer nevertheless. – Spelaean 23/8, 2020 at 22:11

A note for who is using the numpy picewise, if none of the condition is satisfied, piecewise input a default value of 0. This could cause misleading behaviours – Selma 20/10, 2020 at 10:26

Thanks a lot for the detailed answer and the contribution of the polymorph idea – Imperialism 2/8, 2023 at 18:2

A pandas solution would be great if implemented similar to foverlaps() from data.table package in R. So far I've found numpy's piecewise() to be efficient. I've provided the code based on an earlier discussion Merging dataframes based on date range

A['permno'] = np.piecewise(np.zeros(A.count()[0]),
                                 [ (A['cusip'].values == id) & (A['fdate'].values >= start) & (A['fdate'].values <= end) for id, start, end in zip(B['ncusip'].values, B['namedf'].values, B['nameenddt'].values)],
                                 B['permno'].values).astype(int)

Fairman answered 13/2, 2017 at 7:20 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++