Scikit-learn custom score function needs values from dataset other than X and y
Asked Answered
B

1

8

I'm trying to evaluate a model based on its performance on historical sports beting.

I have a dataset that consists of the following columns:

feature1 | ... | featureX | oddsPlayerA | oddsPlayerB | winner

The model will be doing a regression where the output is the odds that playerA wins the match

It is my understanding that I can use a custom scoring function to return the "money" the model would have made if it bet every time a condition is true and use that value to measure the fitness of the model. A condition something like:

if prediction_player_A_win_odds < oddsPlayerA
   money += bet_playerA(oddsPlayerA, winner) 
if inverse_odd(prediction_player_A_win_odds) < oddsPlayerB
   money += bet_playerB(oddsPlayerB, winner) 

In the custom scoring function I need to receive the usual arguments like "ground_truth, predictions" (where ground_truth is the winner[] and predictions is prediction_player_A_win_odds[]) but also the fields "oddsPlayerA" and "oddsPlayerB" from the dataset (and here is the problem!).

If the custom scoring function was called with data in the exact same order as the original dataset it would be trivial to retrieve this extra data needed from the dataset. But in reality when using cross validation methods the data it gets is all mixed up (when compared to the original).

I've tried the most obvious approach which was to pass the y variable with [oddsA, oddsB, winner] (dimensions [n, 3]) but scikit didn't allow it.

So, how can I get data from the dataset into the custom scoring function that is neither X nor y but is still "tied together" in the same order?

Borzoi answered 3/11, 2014 at 0:54 Comment(6)
How are you doing the cross-validation? Scikit provides various iterators that will return indices into the original dataset that you can use to split train/test sets. If you have indices, you can use them to extract the aligned data you need.Rawdin
As far as I know only with cross_validation.cross_val_score can we pass a custom scoring function. So I was using that. I can look for a cross-validation method that will return indices, but how do I use a custom scoring function without using cross_val_score?Borzoi
After thinking a little, I guess I could manually score it using predict and cross-validation with indices (or not? how would I train?). But I would prefer to be able to use tools like grid_search.GridSearchCV (which also allows a custom scoring function). If I could avoid the "manual" approach I would appreciate it.Borzoi
It's an interesting question. I don't see any obvious way to do it, but it seems like a reasonable thing to want to do. We'll see if any more knowledgeable scikit people respond.Rawdin
This message on the sklearn mailing list seems to be asking how to do a similar thing, and according to the response there is no direct way. You would have to roll your own version of GridSearchCV, or pass in all the data and use some sort of wrapper for the actual models that discards the data you don't want to influence the real fit.Rawdin
Thanks for the link. It doesn't look very promissing for me. Now, I got yet another thing that confuses me. Even if GridSearchCV custom score allowed for additional parameters I don't think it would work for my use case would it? My understanding is that it is only used to score the model after training. Or does the GridSearchCV internal fit() also uses (somehow) this custom score function? Hope someone can help!Borzoi
A
4

There is no way to actually do this at the moment, sorry. You can write your own loop over the cross-validation folds, which should not be to hard. You can not do this using GridSearchCV or cross_val_score

Atalaya answered 3/11, 2014 at 18:35 Comment(4)
Are there any plans afoot to do this? It seems like a legit and useful feature. All you would have to do is have GridSearchCV pass the indices to the scorer as well, right?Rawdin
Yes. The scorer interface is pretty new and there are still some things that we haven't really worked out. You are right, this would be a useful feature.Atalaya
So it has been several years, but I have run into a very similar scenario and do not see a solution from searching. Before I copy and modify sklearn.model_selection._validation.cross_validate, I wanted to check if there are any updates on this, maybe a better way to do it? Thanks!Undesigning
You can probably do it much easier than copy and modifying cross_validate but no, there is no general solution. Feel free to open an issue on the sklearn issue tracker with your use-case and we might consider it.Atalaya

© 2022 - 2024 — McMap. All rights reserved.