Random Forest Regressor using a custom objective/ loss function (Python/ Sklearn)

Asked 26/3, 2018 at 14:0 Answered 6/8, 2018 at 12:16

python-3.x scikit-learn random-forest statsmodels poisson

I want to build a Random Forest Regressor to model count data (Poisson distribution). The default 'mse' loss function is not suited to this problem. Is there a way to define a custom loss function and pass it to the random forest regressor in Python (Sklearn, etc..)?

Is there any implementation to fit count data in Python in any packages?

Strobilaceous answered 26/3, 2018 at 14:0 Comment(3)

Better post your question on Stack Exchange stats.stackexchange.com – Exorcist 26/3, 2018 at 14:4

@GauravS: why? I think this fits perfectly to stackoverflow as it regards specific implementation in libraries and not the concept itself. – Tooley 26/3, 2018 at 14:38

@MarcusV.: There might be few folks who have already implemented it. May be they haven't contributed it to Python libraries, but can share the code. – Exorcist 27/3, 2018 at 3:1

In sklearn this is currently not supported. See discussion in the corresponding issue here, or this for another class, where they discuss reasons for that a bit more in detail (mainly the large computational overhead for calling a Python function).

So it could be done as discussed within the issues, by forking sklearn, implementing the cost function in Cython and then adding it to the list of available 'criterion'.

Tooley answered 26/3, 2018 at 14:38 Comment(4)

Thanks for your answer. I am a complete beginner to Cython. Can you please point me to an example implementation or give more details? – Strobilaceous 26/3, 2018 at 15:26

Well, the sklearn devs reference some links here and I have used this tutorial as a starter. – Tooley 26/3, 2018 at 15:51

@Prag1 How did you get on!? :\ – Atrophy 19/11, 2018 at 9:39

The maintainers of sklearn should support custom loss functions, even if there's extra overhead from calling a python function that slows training down. I care more about being able to experiment flexibly. XGBoost can take a custom objective, and so can pytorch; it feels archaic that sklearn can't. xgboost.readthedocs.io/en/latest/python/… I would try to implement it myself, but I doubt they'd merge my PR. The orthodoxy is that it's not a good idea, but orthodoxy shouldn't be taken seriously. – Stocky 12/7, 2021 at 18:18

If the problem is that the counts c_i arise from different exposure times t_i, then indeed one cannot fit the counts, but one can still fit the rates r_i = c_i/t_i using MSE loss function, where one should, however, use weights proportional to the exposures, w_i = t_i.

For a true Random Forest Poisson regression, I've seen that in R there is the rpart library for building a single CART tree, which has a Poisson regression option. I wish this kind of algorithm would have been imported to scikit-learn.

Taddeo answered 18/7, 2018 at 8:16 Comment(0)

In R, writing a custom objective function is fairly simple.

randomForestSRC package in R has provision for writing your own custom split rule. The custom split rule, however has to be written in pure C language.

All you have to do is, write your own custom split rule, register the split rule, compile and install the package.

The custom split rule has to be defined in the file called splitCustom.c in randomForestSRC source code.

You can find more info here.

The file in which you define the split rule is this.

Pforzheim answered 6/8, 2018 at 12:16 Comment(0)

Recommended topics

Hot tags