Given a value from a pandas column DataFrame, select N rows above and below to that closest value in other DataFrame
Asked Answered
E

3

6

I have two pandas DataFrames:

import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}

data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

Output: df1
   score  seconds
0      1     1140
1      2     2100

Output: df2
    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
4    23.0     1650
5    24.0     1680
6    26.0     1700
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

For each value in seconds column from df1, I would like to get the match (or the closest to) row from df2 and also the closest 2 rows above and below the match.

The seconds columns contains only sorted unique values.

As result, I expect this:

Output: result
    prize  seconds
0     5.5      840
1    14.5     1080 # closest match to 1140
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100 # match 2100
10   50.0     2160
Esposito answered 27/2 at 12:15 Comment(3)
What if you have two matches for one target (e.g. if you have another 1080)?Yesseniayester
Also, clarify if the seconds are sorted (if not, should the closest seconds or closest positions be selected?)Yesseniayester
Hello @mozway. the seconds are always sorted and there is no chance of two matches, the seconds columns contains only unique values of seconds. Thank you.Esposito
Y
8

You can use a merge_asof to identify the closest value to each value in df1, then a rolling.max to extend the selection to the neighboring N rows:

N = 2 # number of surronding rows to keep

s1 = df1['seconds'].sort_values()
s2 = df2['seconds'].sort_values().rename('_')

keep = pd.merge_asof(s1, s2, left_on='seconds', right_on='_',
                     direction='nearest')['_']

out = df2[s2.isin(keep)
            .rolling(2*N+1, center=True, min_periods=1)
            .max().astype(bool)]

NB. if the seconds are already sorted, you can skip the .sort_values().

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160

Intermediates:

    prize  seconds  closest  isin(keep)  rolling.max
0     5.5      840      NaN       False         True
1    14.5     1080   1140.0        True         True
2    14.6     1380      NaN       False         True
3    21.0     1620      NaN       False         True
4    23.0     1650      NaN       False        False
5    24.0     1680      NaN       False        False
6    26.0     1700      NaN       False        False
7    38.0     1740      NaN       False         True
8    39.0     2040      NaN       False         True
9    40.0     2100   2100.0        True         True
10   50.0     2160      NaN       False         True
Yesseniayester answered 27/2 at 12:27 Comment(0)
P
0

A semi-vectorized approach, based on numpy broadcasting, can be the following. It first determines the indexes of the rows we want and then uses those indexes to retrieve the rows from df2:

n = 2
matches = np.abs(df2.values[:,1] - df1.values[:,1].reshape(-1, 1)).argmin(1)
lowest_idx = np.maximum(matches-n, 0)
largest_idx = np.minimum(matches+n+1, len(df2))
df2.iloc[np.concatenate(
    [np.arange(x, y) for x, y in zip(lowest_idx, largest_idx)]), :]

Output:

    prize  seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160
Pelton answered 27/2 at 18:35 Comment(0)
K
0
import pandas as pd

data1 = {
    'score': [1, 2],
    'seconds': [1140, 2100],
}
data2 = {
    'prize': [5.5, 14.5, 14.6, 21, 23, 24, 26, 38, 39, 40, 50],
    'seconds': [840, 1080, 1380, 1620, 1650, 1680, 1700, 1740, 2040, 2100, 2160],
}

df1 = pd.DataFrame.from_dict(data1)
df2 = pd.DataFrame.from_dict(data2)

def find_closest_index(seconds):
    idx = df2['seconds'].sub(seconds).abs().idxmin()
    aa = max(0,idx -2)
    bb = min(len(df2),idx +3)
    return aa,bb

cc = [df2.iloc[start:end] for start,end in df1['seconds'].apply(find_closest_index)]
res = pd.concat(cc)
print(res)
'''
    prize    seconds
0     5.5      840
1    14.5     1080
2    14.6     1380
3    21.0     1620
7    38.0     1740
8    39.0     2040
9    40.0     2100
10   50.0     2160
'''
Kirst answered 15/8 at 13:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.