Random Forest Regressor using a custom objective/ loss function (Python/ Sklearn)
Asked Answered
S

3

15

I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?

Is there any implementation to fit count data in Python in any packages?

Strobilaceous answered 26/3, 2018 at 14:0 Comment(3)
Better post your question on Stack Exchange stats.stackexchange.comExorcist
@GauravS: why? I think this fits perfectly to stackoverflow as it regards specific implementation in libraries and not the concept itself.Tooley
@MarcusV.: There might be few folks who have already implemented it. May be they haven't contributed it to Python libraries, but can share the code.Exorcist
T
8

In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).

So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.

Tooley answered 26/3, 2018 at 14:38 Comment(4)
Thanks for your answer. I am a complete beginner to Cython. Can you please point me to an example implementation or give more details?Strobilaceous
Well, the sklearn devs reference some links here and I have used this tutorial as a starter.Tooley
@Prag1 How did you get on!? :\Atrophy
The maintainers of sklearn should support custom loss functions, even if there's extra overhead from calling a python function that slows training down. I care more about being able to experiment flexibly. XGBoost can take a custom objective, and so can pytorch; it feels archaic that sklearn can't. xgboost.readthedocs.io/en/latest/python/… I would try to implement it myself, but I doubt they'd merge my PR. The orthodoxy is that it's not a good idea, but orthodoxy shouldn't be taken seriously.Stocky
T
0

If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.

For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.

Taddeo answered 18/7, 2018 at 8:16 Comment(0)
P
0

In R, writing a custom objective function is fairly simple.

randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.

All you have to do is, write your own custom split rule, register the split rule, compile and install the package.

The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.

You can find more info here.

The file in which you define the split rule is this.

Pforzheim answered 6/8, 2018 at 12:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.