Pandas: Resample dataframe column, get discrete feature that corresponds to max value
Asked Answered
M

3

6

Sample data:

import pandas as pd
import numpy as np
import datetime

data = {'value': [1,2,4,3], 'names': ['joe', 'bob', 'joe', 'bob']}
start, end = datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 4)
test = pd.DataFrame(data=data, index=pd.DatetimeIndex(start=start, end=end, 
       freq="D"), columns=["value", "names"])

gives:

          value names
2015-01-01  1   joe
2015-01-02  2   bob
2015-01-03  4   joe
2015-01-04  3   bob

I want to resample by '2D' and get the max value, something like:

df.resample('2D')

The expected result should be:

          value names
 2015-01-01 2   bob
 2015-01-03 4   joe

Can anyone help me?

Micrometry answered 27/6, 2017 at 20:56 Comment(1)
I've updated my answer if you're interested.Landa
K
6

You can resample to get the arg max of value and then use it to extract names and value

(df.resample('2D')[['value']].idxmax()
   .assign(names=lambda x: df.loc[x.value]['names'].values,
           value=lambda x: df.loc[x.value]['value'].values)
)
Out[116]: 
            value names
2015-01-01      2   bob
2015-01-03      4   joe
Katrinakatrine answered 27/6, 2017 at 21:3 Comment(1)
Super solution. This also extends to data with the same dates.Micrometry
L
4

Use apply and return the row with maximal value. It will get labeled via the resample

test.resample('2D').apply(lambda df: df.loc[df.value.idxmax()])

            value names
2015-01-01      2   bob
2015-01-03      4   joe
Landa answered 27/6, 2017 at 21:9 Comment(4)
As I said to ayhan :-), this doesn't give the index that the OP is expecting. There might be a slick way to do it in one line, but I think you could just name the idxmax() result something and then set_index(ii.index) to patch it.Virgule
@Virgule I did the same thing inside an apply. This way the indices are handled by the resample but I get the rows I want. Thanks for letting me know, I was in a meeting and couldn't respond right away (-:Landa
This gets AttributeError: 'Series' object has no attribute 'value' on pandas v1.1.2.Suzette
value in this context was a column designated by the OP. It could have been written as test.resample('2D').apply(lambda df: df.loc[df['value']idxmax()]) to make it clearer.Landa
J
0

The idxmax works well unless there are missing values in the dates. For example, if you resample every day, and one day has no values, instead of returning Nan, idxmax will raise an error.

The following is how to overcome the problems

def map_resample_columns(original_df, resample_df, key_col, cols):
    """
    The function will add the col back to resampled_df
    input: resample_df is resampled from original df based on key_col
    cols: list of columns from original_df to be added back to resample_df    
    """
    for col in cols:
        record_info = []
        for idx, row in resample_df.iterrows():
            val = row[key_col]
            if not np.isnan(val):
                record_info.append(original_df[original_df[key_col] == val][col].tolist()[0])
            else:
                record_info.append(np.nan)
        resample_df[col] = record_info
    return resample_df
Job answered 17/5, 2022 at 20:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.