Pandas Merge on Name and Closest Date
Asked Answered
S

3

25

I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question here but it doesn't account for the name as well. From the above question it doesn't seem like there is a way to do this with merge but I can't see another way to do the two argument join that doesn't use the pandas merge function.

Is there a way to do this with merge? And if not what would be the appropriate way to do this?

I will post a copy of what I have tried but this was trying it with an exact merge on date which will not work. The most important line is the last one where I make the data3 dataframe.

data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)

data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})

list(data2.columns.values)

data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']

print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])

data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")
Sissel answered 7/7, 2014 at 15:39 Comment(6)
You won't be able to merge using a partial match, you'd have to merge what you can and then perform a lookup for the other rows, I've done this before where there were inexact matches. You have to write some function and then apply it row-wise to your merged dataframeHung
Great. I would love to see that code if you still have it laying around. As a side note, I think its sort of odd that pandas doesn't have an implementation of this as this seems like a fairly common thing that someone might want to do. Or maybe that's just me..Sissel
I was kinda implying that you'd need to try to write the function yourself, basically the elements are 1. does a date exist, if not then do a searchsorted call on the data to find where if I was to insert the date, what the index value would be, then use this index value plus the previous value to see which is closest and return it.Hung
Haha that's pretty much what I did. I merged it but then instead of a searchsorted call I did some tricky masking to make it find the closest date. That did the trick. It did take a long time though. I sure still would have liked to have a way to do it in merge. :)Sissel
There's nothing built in I'm afraid, I used searchsorted as this is very fast on a numpy array but the key is to trim the data down to only those that need the lookup as iterating over a dataframe is slow so you don't want to do this unnecessarily, you should post your answer and then accept so this answer doesn't stay unanswered.Hung
Ok will do. I am still ironing out a few kinks in the code but I will post it when I get them fixed.Sissel
B
7

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48
Bores answered 21/9, 2014 at 18:10 Comment(0)
T
37

This is super late, but hopefully its helpful for new answer seekers. I answered a similar question here

with a somewhat new method in pandas:

pandas.merge_asof()

The parameters of interest for you would be direction,tolerance,left_on, and right_on

Building off @hernamesbarbara answer & data:

data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

solution

import pandas as pd
from pandas import read_csv
from io import StringIO

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])



df1['timepoint'] = pd.to_datetime(df1['timepoint'])
df2['timepoint'] = pd.to_datetime(df2['timepoint'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['timepoint']
df2.index = df2['timepoint']
# the magic happens below, check the direction and tolerance arguments
# if you want you can make a maximum tolerance on which to merge data
tol = pd.Timedelta('3 day')
df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

df3.head()

    timepoint_x measure_x   timepoint_y measure_y
timepoint               
2014-01-01  2014-01-01  78  2014-01-01  78
2014-01-02  2014-01-02  29  2014-01-01  78
2014-01-03  2014-01-03  5   2014-01-01  78
2014-01-04  2014-01-04  73  2014-01-01  78
2014-01-05  2014-01-05  40  2014-01-08  29
Thermion answered 18/7, 2018 at 14:25 Comment(3)
can you use tolerance to find the nearest? Regardless of how far away it might be?Ohaus
tolerance is an optional parameter, so you should leave it blank and set direction='nearest' for that @MonicaHeddneck. More info here: pandas.pydata.org/docs/reference/api/pandas.merge_asof.htmlThermion
I think this doesn't answer the question. Because what is needed here is to do a merge between the closest date and name, so it's a double merge, not only considering the dateArmlet
B
7

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48
Bores answered 21/9, 2014 at 18:10 Comment(0)
C
1

A small addition to the hernamesbarbara's code

def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
    """takes a pd.Timestamp() instance and a pd.Series with dates in it
    calcs the delta between `timepoint` and each date in `time_series`
    returns the closest date and optionally the number of days in its time delta

    Parameters
    ----------
    mode: "abs" (default), "left", "right"
        closest datetime by abs, at left, at right

    References
    ----------
    .. [1] https://mcmap.net/q/528593/-pandas-merge-on-name-and-closest-date
    """
    deltas = time_series - timepoint

    idx_closest_date = None
    if mode == "abs":
        idx_closest_date = np.argmin(abs(deltas))
    elif mode == "left":
        deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmax(deltas_)
    elif mode == "right":
        deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
        if len(deltas_):
            idx_closest_date = np.argmin(deltas_)
    else:
        raise Exception("Mode is incorrect")

    if idx_closest_date is not None:
        closest_date = time_series.ix[idx_closest_date]
        if add_time_delta_column:
            closest_delta = deltas[idx_closest_date]
    else:
        closest_date = pd.NaT
        if add_time_delta_column:
            closest_delta = pd.Timedelta(pd.NaT)

    res = {"closest_date": closest_date}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = closest_delta
        idx.append('closest_delta')

    return pd.Series(res, index=idx)
Choplogic answered 22/11, 2016 at 9:22 Comment(1)
df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply( find_closest_date, args=[df2.timepoint], mode="abs")Choplogic

© 2022 - 2024 — McMap. All rights reserved.