Pandas: Approximate join on one column, exact match on other columns
Asked Answered
S

1

6

I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the “leftovers” are also interesting if not too hard to get hold of. Most of the “exact match” observations will exist multiple times in each data frame.

I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...

The dataframes look something like:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

df1
Out[430]: 
       col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21

df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')

df2
Out[434]: 
      col1   col2 col3        date
index                             
b1     132    asd    1  2010-01-23
b2     432  dsa12    2  2016-05-23
b3     432  dsa12    2  2010-06-10
b4     123    sd2    3  2008-10-21

In the end I want something like:

       col1   col2 col3        date diff match_index
index                              
a1     1232    asd    1  2010-01-23  nan         nan
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
a4      123   asd2    3  2008-10-21  nan         nan
a5      123    sd2    3  2008-10-21  nan          b4

or if it's way easier with just an inner join I'd like:

       col1   col2 col3        date diff match_index
index                                                     
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
Shimmery answered 31/5, 2016 at 11:37 Comment(1)
It might make sense to ask separate questions for each of the things you're looking for.Yusem
G
2

I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1 based on only one column, here we match on multiple columns using a groupby and do it on both dataframes. IF you do want to explicitly include the merge command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.

import pandas as pd
from sklearn.neighbors import NearestNeighbors

    def find_nearest(group, df2, groupname):
        try:
            match = df2.groupby(groupname).get_group(group.name)
            match['date'] = pd.to_datetime(match.date, unit = 'D')
            nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
            dist, ind = nbrs.kneighbors(group['date'].values[:, None])

            group['date1'] = group['date']
            group['date'] = match['date'].values[ind.ravel()]
            group['diff'] = (group['date1']-group['date'])
            group['match_index'] = match.index[ind.ravel()]
            return group
        except KeyError:
            return group

    #change dates from string to datetime
    df1['date'] = pd.to_datetime(df1.date, unit = 'D')
    df2['date'] = pd.to_datetime(df2.date, unit = 'D')

    #find closest dates and differences
    keys = ['col1', 'col2', 'col3']
    df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)

    #fill unmatched dates 
    df1_mod.date1.fillna(df1_mod.date, inplace=True)

    df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) 
    df2_mod.date1.fillna(df2_mod.date, inplace=True)

    #drop original column 
    df1_mod.drop('date', inplace=True, axis=1)
    df1_mod.rename(columns = {'date1':'date'}, inplace=True)

    df2_mod.drop('date', inplace=True, axis=1)
    df2_mod.rename(columns = {'date1':'date'}, inplace=True)
    df2_mod['diff'] = -df2_mod['diff']

    #drop redundant values
    df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)

    #merge the two 
    df_final = pd.merge(df1_mod, df2_mod, how='outer')

This yields the following result:

In [349]: df_final
Out[349]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1   432  dsa12    2 2016-05-20 -3 days          b2
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

With the merge command:

In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
  col1   col2 col3       date    diff match_index
0  432  dsa12    2 2016-05-20 -3 days          b2
2  432  dsa12    2 2010-06-20 10 days          b3

The case considered in the comments, namely:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

yields the following:

In [351]: df_final
Out[351]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1  1432  dsa12    2 2016-05-20     NaT         NaN
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN
Gar answered 31/5, 2016 at 15:29 Comment(11)
Thanks Gene! This is almost it, but it matches on only 1 columncol3 whilst I want to match on all of col1,col2 and col3. I tried concatenating them into one using df1['matchCol'] = df1['col1'].astype(str) + df1['col2'].astype(str) + df1['col3'].astype(str) but it wouldn't have it... it seems it only works when the exact match is a numeric value, not a string?Shimmery
sorry it has nothing to do with numeric vs strings... still not sure why it won't work, though, it gives a ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.Shimmery
CAR ERL, I don't quite follow. If possible, could you provide the code snipped that you're running and yields the aforementioned error message?Gar
Sorry if I was unclear Gene. If you change the first line to: df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index') you'll notice df_finalwill still show two matches; but the first match shoult not be a match because the col1 values no longer matches; only col2 and col3 columns match.Shimmery
Just noticed that the pd.merge command you mentioned at the end does work nicely; but then I only get the inner joins; is there a way to add all non-matches from df1 and df2 onto this dataset too? I've been trying different pd.merge versions but just can't seem to get it straight...Shimmery
Gene, I got it working using the merge version; updated your answer to include the full working example I used. Thanks HEAPS!Shimmery
@CARERL, thanks mate. If this suffices, would you mind marking it as the answer? On that note, let me see if I can make the outer join hack work for youGar
to have the unmatched observations stay after the merge try adding the how='outer' argument to the pd.merge(), that'll return also return the unmatched values in columns col1, col2, and col3Gar
Let us continue this discussion in chat.Shimmery
Is there an easier version of implementing this in now, especially since pandas.merge_asof has been introduced (I think it was shortly after this post?).Spathe
Creating a new post for this here: #64356176Spathe

© 2022 - 2024 — McMap. All rights reserved.