How to merge two data frames based on nearest date
Asked Answered
M

2

6

I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?

df = df1[column_names1].merge(df2[column_names2], on='Code')
Melitamelitopol answered 29/10, 2015 at 18:6 Comment(3)
related question: #24614974Flintlock
Here is a better answer: #21202118Flintlock
@jakevdp: Thanks, but how can I combine it with merging by Code? Should I first use "searchsorted" and then put "mask = idx >= 0 &..."?Melitamelitopol
F
8

I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:

  1. add a column to df1 with the closest date from the appropriate group in df2

  2. call a standard merge on these

As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.

I've put together one approach to that solution that should scale relatively well. First we can generate some simple data:

import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')

rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])

df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                    'Date': dates[i1],
                    'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                    'Date': dates[i2],
                    'val2':rand.rand(5)})

Let's check these out:

>>> df1
   Code       Date      val1
0     0 2015-01-16  0.975852
1     0 2015-01-31  0.516300
2     1 2015-04-06  0.322956
3     1 2015-05-09  0.795186
4     1 2015-06-08  0.270832

>>> df2
   Code       Date      val2
0     1 2015-02-03  0.184334
1     1 2015-04-13  0.080873
2     0 2015-05-02  0.428314
3     1 2015-06-26  0.688500
4     0 2015-06-30  0.058194

Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:

from sklearn.neighbors import NearestNeighbors

def find_nearest(group, match, groupname):
    match = match[match[groupname] == group.name]
    nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
    dist, ind = nbrs.kneighbors(group['Date'].values[:, None])

    group['Date1'] = group['Date']
    group['Date'] = match['Date'].values[ind.ravel()]
    return group

df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
   Code       Date      val1      Date1
0     0 2015-05-02  0.975852 2015-01-16
1     0 2015-05-02  0.516300 2015-01-31
2     1 2015-04-13  0.322956 2015-04-06
3     1 2015-04-13  0.795186 2015-05-09
4     1 2015-06-26  0.270832 2015-06-08

Finally, we can merge these together with a straightforward call to pd.merge:

>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
   Code       Date      val1      Date1      val2
0     0 2015-05-02  0.975852 2015-01-16  0.428314
1     0 2015-05-02  0.516300 2015-01-31  0.428314
2     1 2015-04-13  0.322956 2015-04-06  0.080873
3     1 2015-04-13  0.795186 2015-05-09  0.080873
4     1 2015-06-26  0.270832 2015-06-08  0.688500

Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.

Flintlock answered 30/10, 2015 at 14:28 Comment(4)
Not related to this question but your PYCON 2015 SKLEARN was very nice. Much appreciated for sharing!Wondrous
I have a problem with dates in the format: 2015-10-19T07:42:00.000 Do you have any idea how to parse them so that the code works?Melitamelitopol
You can convert from string to date with pd.to_datetime()Flintlock
Is there a way to modify find_nearest so that it takes less parameters and matches on date only?Misdeal
T
0

Here's an alternative solution:

  1. Merge on Code.

  2. Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.

  3. Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.

Code:

df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()
Tragus answered 18/9, 2016 at 17:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.