Pandas: custom WMAPE function aggregation function to multiple columns without for-loop?
Asked Answered
C

3

5

Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.

Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.

Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv: https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1

Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.

Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).

Successful Code so far: Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):

def wmape(actual, forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(actual-forecast)/actual

    # get a float of the sum of the actual
    ft_actual_sum = actual.sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = actual * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

def wmape_gr(df_in, st_actual, st_forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

    # get a float of the sum of the actual
    ft_actual_sum = df_in[st_actual].sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = df_in[st_actual] * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)

# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')

Output Looks Like (first two rows):

enter image description here

Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.

Corvin answered 22/2, 2019 at 16:29 Comment(0)
P
4

If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:

def wmape(actual, forecast):
    # Take a series (actual) and a dataframe (forecast) and calculate wmape
    # for each forecast. Output shape is (1, num_forecasts)

    # Convert to numpy arrays for broadasting
    forecast = np.array(forecast.values)
    actual=np.array(actual.values).reshape((-1, 1))

    # Make an array of mape (same shape as forecast)
    se_mape = abs(actual-forecast)/actual

    # Calculate sum of actual values
    ft_actual_sum = actual.sum(axis=0)

    # Multiply the actual values by the mape
    se_actual_prod_mape = actual * se_mape

    # Take the sum of the product of actual values and mape
    # Make sure to sum down the rows (1 for each column)
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)

    # Calculate the wmape for each forecast and return as a dictionary
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
    return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}

Then use apply on the proper columns:

# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'], 
                                        x[[c for c in x if 'Forecast' in c]])).\
            to_frame().reset_index()

This results in a dataframe with a single dictionary column. Intermediate Results

The single column can be converted to multiple columns for the correct format:

# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]), 
                    pd.DataFrame(list(new_df[0].values))], axis=1)

Result:

Result of operations

Prehistoric answered 22/2, 2019 at 18:36 Comment(5)
Hi @willk, the solution you provided works flawlessly in Spyder (Python 3.6). However if I try to run it in a Jupyter notebook, I get a very strange error on the return line of the function you rewrote. Do you have thoughts? : File "<ipython-input-4-3a2661b6495f>", line 88 return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)} ^ SyntaxError: invalid syntaxCorvin
What version of Python are you running in your notebook? It sounds like your Python in the notebook might not have f-string formatting (available as of Python 3.6)Prehistoric
If you are using an older version of Python, then you'll have to use string formatting like: return {'Forecast_%d_wmape:' % i: wmape for i, wmape in enumerate(ft_wmape_forecast)}Prehistoric
It appears to be a version mismatch between my two python environments as you suggested. Sorry about that, thank you for the response.Corvin
No problem, it's good to be aware of which version of Python and which version of external libraries you are using. Often errors can be traced to a version mismatch, something with which I'm all too familiar!Prehistoric
S
7

This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.

  1. Any calculation that is independent of the group should not be done within a groupby
  2. If there is a built-in groupby method, use it first before using apply

Let's go line by line through your wmape_gr function.

se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:

df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']

Let's take a look at the next line:

ft_actual_sum = df_in[st_actual].sum()

This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.

Let's move to the next line:

se_actual_prod_mape = df_in[st_actual] * se_mape

This again is independent of the group. Let's calculate it on the DataFrame as a whole.

df['forecast1_wampe'] = df['actual_forecast_diff_1'] *  df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] *  df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] *  df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] *  df['Actual']

Let's move on to the last two lines:

ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.

We can run two separate groupby operations to sum each of these columns like this:

g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()

We get the following Series and DataFrame returned

enter image description here

enter image description here

Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div method to change the orientation of the division so that the indexes align

forecast1_wampe_sum.div(actual_sum, axis='index')

And this returns our answer:

enter image description here

Snub answered 25/2, 2019 at 17:24 Comment(0)
P
4

If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:

def wmape(actual, forecast):
    # Take a series (actual) and a dataframe (forecast) and calculate wmape
    # for each forecast. Output shape is (1, num_forecasts)

    # Convert to numpy arrays for broadasting
    forecast = np.array(forecast.values)
    actual=np.array(actual.values).reshape((-1, 1))

    # Make an array of mape (same shape as forecast)
    se_mape = abs(actual-forecast)/actual

    # Calculate sum of actual values
    ft_actual_sum = actual.sum(axis=0)

    # Multiply the actual values by the mape
    se_actual_prod_mape = actual * se_mape

    # Take the sum of the product of actual values and mape
    # Make sure to sum down the rows (1 for each column)
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)

    # Calculate the wmape for each forecast and return as a dictionary
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
    return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}

Then use apply on the proper columns:

# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'], 
                                        x[[c for c in x if 'Forecast' in c]])).\
            to_frame().reset_index()

This results in a dataframe with a single dictionary column. Intermediate Results

The single column can be converted to multiple columns for the correct format:

# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]), 
                    pd.DataFrame(list(new_df[0].values))], axis=1)

Result:

Result of operations

Prehistoric answered 22/2, 2019 at 18:36 Comment(5)
Hi @willk, the solution you provided works flawlessly in Spyder (Python 3.6). However if I try to run it in a Jupyter notebook, I get a very strange error on the return line of the function you rewrote. Do you have thoughts? : File "<ipython-input-4-3a2661b6495f>", line 88 return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)} ^ SyntaxError: invalid syntaxCorvin
What version of Python are you running in your notebook? It sounds like your Python in the notebook might not have f-string formatting (available as of Python 3.6)Prehistoric
If you are using an older version of Python, then you'll have to use string formatting like: return {'Forecast_%d_wmape:' % i: wmape for i, wmape in enumerate(ft_wmape_forecast)}Prehistoric
It appears to be a version mismatch between my two python environments as you suggested. Sorry about that, thank you for the response.Corvin
No problem, it's good to be aware of which version of Python and which version of external libraries you are using. Often errors can be traced to a version mismatch, something with which I'm all too familiar!Prehistoric
O
1

without changing the functions

applying four times

df_gr1 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
df_gr2 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_2')
df_gr3 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_3')
df_gr4 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_4')

join them together

all1= pd.concat([df_gr1, df_gr2,df_gr3,df_gr4],axis=1, sort=False)

get the columns for city,person and DT

all1['city']= [all1.index[i][0]  for i in range(len(df_gr1))]
all1['Person']= [all1.index[i][1]  for i in range(len(df_gr1))]
all1['DT']= [all1.index[i][2]  for i in range(len(df_gr1))]

rename the columns and change order

df = all1.rename(columns={0:'Forecast_1_wmape', 1:'Forecast_2_wmape',2:'Forecast_3_wmape',3:'Forecast_4_wmape'})

df = df[['city','Person','DT','Forecast_1_wmape','Forecast_2_wmape','Forecast_3_wmape','Forecast_4_wmape']]

df=df.reset_index(drop=True)
Oldest answered 1/3, 2019 at 2:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.