I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the “leftovers” are also interesting if not too hard to get hold of. Most of the “exact match” observations will exist multiple times in each data frame.
I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...
The dataframes look something like:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
In the end I want something like:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
or if it's way easier with just an inner join I'd like:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3