Given a value from a pandas column DataFrame, select N rows above and below to that closest value in other DataFrame

Asked 27/2, 2024 at 12:15 Answered 15/8, 2024 at 13:28

I have two pandas DataFrames:

import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}

data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

Output: df1
   score  seconds
0      1     1140
1      2     2100

Output: df2
    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
4    23.0     1650
5    24.0     1680
6    26.0     1700
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

For each value in seconds column from df1, I would like to get the match (or the closest to) row from df2 and also the closest 2 rows above and below the match.

The seconds columns contains only sorted unique values.

As result, I expect this:

Output: result
    prize  seconds
0     5.5      840
1    14.5     1080 # closest match to 1140
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100 # match 2100
10   50.0     2160

Esposito answered 27/2, 2024 at 12:15 Comment(3)

What if you have two matches for one target (e.g. if you have another 1080)? – Yesseniayester 27/2, 2024 at 12:18

Also, clarify if the seconds are sorted (if not, should the closest seconds or closest positions be selected?) – Yesseniayester 27/2, 2024 at 12:21

Hello @mozway. the seconds are always sorted and there is no chance of two matches, the seconds columns contains only unique values of seconds. Thank you. – Esposito 27/2, 2024 at 12:26

You can use a merge_asof to identify the closest value to each value in df1, then a rolling.max to extend the selection to the neighboring N rows:

N = 2 # number of surronding rows to keep

s1 = df1['seconds'].sort_values()
s2 = df2['seconds'].sort_values().rename('_')

keep = pd.merge_asof(s1, s2, left_on='seconds', right_on='_',
                     direction='nearest')['_']

out = df2[s2.isin(keep)
            .rolling(2*N+1, center=True, min_periods=1)
            .max().astype(bool)]

NB. if the seconds are already sorted, you can skip the .sort_values().

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

Intermediates:

    prize  seconds  closest  isin(keep)  rolling.max
0     5.5      840      NaN       False         True
1    14.5     1080   1140.0        True         True
2    14.6     1380      NaN       False         True
3    21.0     1620      NaN       False         True
4    23.0     1650      NaN       False        False
5    24.0     1680      NaN       False        False
6    26.0     1700      NaN       False        False
7    38.0     1740      NaN       False         True
8    39.0     2040      NaN       False         True
9    40.0     2100   2100.0        True         True
10   50.0     2160      NaN       False         True

Yesseniayester answered 27/2, 2024 at 12:27 Comment(0)

A semi-vectorized approach, based on numpy broadcasting, can be the following. It first determines the indexes of the rows we want and then uses those indexes to retrieve the rows from df2:

n = 2
matches = np.abs(df2.values[:,1] - df1.values[:,1].reshape(-1, 1)).argmin(1)
lowest_idx = np.maximum(matches-n, 0)
largest_idx = np.minimum(matches+n+1, len(df2))
df2.iloc[np.concatenate(
    [np.arange(x, y) for x, y in zip(lowest_idx, largest_idx)]), :]

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

Pelton answered 27/2, 2024 at 18:35 Comment(0)

import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}
data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

def find_closest_index(seconds):
    idx = df2['seconds'].sub(seconds).abs().idxmin()
    aa = max(0,idx -2)
    bb = min(len(df2),idx +3)
    return aa,bb

cc = [df2.iloc[start:end] for start,end in df1['seconds'].apply(find_closest_index)]
res = pd.concat(cc)
print(res)
'''
    prize    seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160
'''

Kirst answered 15/8, 2024 at 13:28 Comment(0)

Recommended topics

Hot tags