In the solution by firelynx here on StackOverflow, that suggests that Polymorphism does not work. I have to agree with firelynx (after extensive testing). However, combining that idea of Polymorphism with the numpy broadcasting solution of piRSquared, it can work!
The only problem is that in the end, under the hood, the numpy broadcasting does actually do some sort of cross-join where we filter all elements that are equal, giving an O(n1*n2)
memory and O(n1*n2)
performance hit. Probably, there is someone who can make this more efficient in a generic sense.
The reason I post here is that the question of the solution by firelynx is closed as a duplicate of this question, where I tend to disagree. Because this question and the answers therein do not give a solution when you have multiple points belonging to multiple intervals, but only for one point belonging to multiple intervals. The solution I propose below, does take care of these n-m relations.
Basically, create the two following classes PointInTime
and Timespan
for the Polymorphism.
from datetime import datetime
class PointInTime(object):
doPrint = True
def __init__(self, year, month, day):
self.dt = datetime(year, month, day)
def __eq__(self, other):
if isinstance(other, self.__class__):
r = (self.dt == other.dt)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance(other, Timespan):
r = (other.start_date < self.dt < other.end_date)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (Timespan in PointInTime) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (PointInTime)')
return NotImplemented
def __repr__(self):
return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)
class Timespan(object):
doPrint = True
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date
def __eq__(self, other):
if isinstance(other, self.__class__):
r = ((self.start_date == other.start_date) and (self.end_date == other.end_date))
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance (other, PointInTime):
r = self.start_date < other.dt < self.end_date
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (PointInTime in Timespan) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (Timespan)')
return NotImplemented
def __repr__(self):
return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day, self.end_date.year, self.end_date.month, self.end_date.day)
BTW, if you wish to not use ==, but other operators (such as !=, <, >, <=, >=) you can create the respective function for them (__ne__
, __lt__
, __gt__
, __le__
, __ge__
).
The way you can use this in combination with the broadcasting is as follows.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"pit":[(x) for x in [PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3), PointInTime(2015,4,4)]], 'vals1':[1,2,3,4]})
df2 = pd.DataFrame({"ts":[(x) for x in [Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1)), Timespan(datetime(2015,2,1), datetime(2015,2,5))]], 'vals2' : ['a', 'b', 'c']})
a = df1['pit'].values
b = df2['ts'].values
i, j = np.where((a[:,None] == b))
res = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
)
print(df1)
print(df2)
print(res)
This gives the output as expected.
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
pit vals1
0 2015-1-1 1
1 2015-2-2 2
2 2015-3-3 3
3 2015-4-4 4
ts vals2
0 2015-2-1 -> 2015-2-5 a
1 2015-2-1 -> 2015-4-1 b
2 2015-2-1 -> 2015-2-5 c
pit vals1 ts vals2
0 2015-2-2 2 2015-2-1 -> 2015-2-5 a
1 2015-2-2 2 2015-2-1 -> 2015-4-1 b
2 2015-2-2 2 2015-2-1 -> 2015-2-5 c
3 2015-3-3 3 2015-2-1 -> 2015-4-1 b
Probably the overhead of having the classes might have an additional performance loss compared to basic Python types, but I have not looked into that.
The above is how we create the "inner" join. It should be straightforward to create the "(outer) left", "(outer) right" and "(full) outer" joins.