I'm trying to evaluate a model based on its performance on historical sports beting.
I have a dataset that consists of the following columns:
feature1 | ... | featureX | oddsPlayerA | oddsPlayerB | winner
The model will be doing a regression where the output is the odds that playerA wins the match
It is my understanding that I can use a custom scoring function to return the "money" the model would have made if it bet every time a condition is true and use that value to measure the fitness of the model. A condition something like:
if prediction_player_A_win_odds < oddsPlayerA
money += bet_playerA(oddsPlayerA, winner)
if inverse_odd(prediction_player_A_win_odds) < oddsPlayerB
money += bet_playerB(oddsPlayerB, winner)
In the custom scoring function I need to receive the usual arguments like "ground_truth, predictions" (where ground_truth is the winner[] and predictions is prediction_player_A_win_odds[]) but also the fields "oddsPlayerA" and "oddsPlayerB" from the dataset (and here is the problem!).
If the custom scoring function was called with data in the exact same order as the original dataset it would be trivial to retrieve this extra data needed from the dataset. But in reality when using cross validation methods the data it gets is all mixed up (when compared to the original).
I've tried the most obvious approach which was to pass the y variable with [oddsA, oddsB, winner] (dimensions [n, 3]) but scikit didn't allow it.
So, how can I get data from the dataset into the custom scoring function that is neither X nor y but is still "tied together" in the same order?