how to use sklearn when target variable is a proportion
Asked Answered
Y

2

9

There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:

http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable

http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression

I cannot tell if there exists a work-around within the sklearn framework.

Yovonnda answered 29/5, 2017 at 4:38 Comment(0)
C
21

There exists a workaround, but it is not intrinsically within the sklearn framework.

If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:

  • Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
  • Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).

There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.

In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).

import numpy as np
from sklearn.linear_model import LinearRegression


class LogitRegression(LinearRegression):

    def fit(self, x, p):
        p = np.asarray(p)
        y = np.log(p / (1 - p))
        return super().fit(x, y)

    def predict(self, x):
        y = super().predict(x)
        return 1 / (np.exp(-y) + 1)


if __name__ == '__main__':
    # generate example data
    np.random.seed(42)
    n = 100
    x = np.random.randn(n).reshape(-1, 1)
    noise = 0.1 * np.random.randn(n).reshape(-1, 1)
    p = np.tanh(x + noise) / 2 + 0.5

    model = LogitRegression()
    model.fit(x, p)

    print(model.predict([[-10], [0.0], [1]]))
    # [[  2.06115362e-09]
    #  [  5.00000000e-01]
    #  [  8.80797078e-01]]
  • There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.

[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.

Carline answered 29/5, 2017 at 6:43 Comment(18)
Could you please explain what should be done about training / test data which contains a probability of 0 or 1? y is -inf and div by zero in these cases.Mendicity
@JakeDrew The easiest solution would be to replace 0 with e and 1 with 1-e, where e is a very small number. (You can also sanitize the probabilities with p = p * e + 0.5 * e). I guess e = 1e-16 would work well.Carline
Thanks for the fast response! I was experimenting exactly as you suggested earlier. I found that for the range p = (0, 1) using .009 and .991 for values of 0 and 1 produce an 10-fold cv MAE = 0.059 or 5.9%. Using p = 9e-16 on the same data drives the MAE up to 0.2266 or 22.6%. The precision of e seems to have a huge impact on the mean absolute error. When y = np.log(p / (1 - p)) and p=0.991 then y=6.9. When p= 9e-16, y=36.7. Perhaps I am over-fitting to my own dataset?Mendicity
@JakeDrew Overfitting is always a realistic possibility ;) The most extreme values have a lot of leverage on the regression and on the error estimates. There is not much general advice I can give here... best plot the variously transformed data (x vs y) and the resulting regression lines to get a feeling for what is happening.Carline
Thanks for your help! I found GLM family=binomial(link='logit') in R which accepts probabilities. However, my best results were using Random Forest Regression and SVR in sklearn. RFR produces an MAE of 2.42% during 10-fold cv while SVR was at 1.87 MAE even with a max() prediction of 100.93.Mendicity
p/(1-p) in the answer should be log(p/(1-p))Thermomagnetic
Thank you for a great answer (tho I'm a bit late to the party). Any idea if there's an official term or name for using a different linear model for logit-transformed dependant variables?Fluctuation
@Fluctuation I don't know any official term, but unofficially I would start by calling the whole thing a Generalized Linear Model (GLM). Then I would proceed by noting that these different linear models are actually only different algorithms for fitting a linear model. Depending on the used algorithm or it's properties I would construct a final name like Regularized GLM, Sparse GLM, Ridge GLM, Lasso GLM, etc.Carline
Thank you for a quick reply, this sounds quite reasonable. Another question, if you know: the info about the output of certain regressors is a bit difficult to find from their sklearn documentation (i.e. for the Random Forests, it only mentions being bounded by the training examples in an example page, not directly in the documentation). Do you know if there are any other standard sklearn regressors that would likewise limit their output according to the training examples, except Random Forests?Fluctuation
@Fluctuation No, I just mentioned Random Forests because I know them very well ;)Carline
How would you "test" the LogitRegressor? I assume mean-squared-error is not usefull in the logit domaineStomy
If I understood this correctly, this is basically the same fit/predict formulation for Logistic Regression, except for the loss function where here you rely on MSE from linear regression while logistic regression minimizes the log loss on a 1 or 0 truth label.Augsburg
@Stomy cross entropyCarline
@Augsburg In logistic regression p is a latent variable which requires iterative minimization of the loss function. Here p is known, which allows us to minimize the MSE in a closed form solution.Carline
How would I determine the 95% confidence interval with this?Arnelle
@HaikalYeo the confidence interval of what?Carline
Of the prediction.Arnelle
@HaikalYeo Maybe it's possible to take the confidence interval of the linear regression and apply the transformation. Alternatively, bootstrapping should always work.Carline
P
0

Mighty late to the party, but in case this can help anyone else out there:

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from scipy.special import polygamma, logit, expit
import pandas as pd
import numpy as np


class BetaRegressor(BaseEstimator, RegressorMixin):
    '''
    A Sklearn-compliant regressor class implementing the beta regression for
    continuous responses over the (0, 1) support.

    Parameters
    ----------
    `fit_intercept: bool = True`
        Optional flag indicating whether an intercept parameter should be
        included. Defaults to `True`.
    `maxiter: int = 100`
        The maximum number of iterations the Newton method is allowed to
        perform. Defaults to `100`.
    `eps: float = 0.0001`
        The maximum error the numerical method must reach to achieve
        convergence. Defaults to `0.0001`.
    `use_smart_seed: bool = True`
        Optional flag controlling whether a transformed regression should be
        use to provide better seed values for the algorithm. Defaults to
        `True`.

    Raises
    ------
    `AssertionError`
        If the maximum number of iterations is non-positive.
    `AssertionError`
        If the error tolerance parameter is non-positive.
    '''

    def __init__(
        self, fit_intercept: bool = True, maxiter: int = 100,
        eps: float = 0.0000001, use_smart_seed: bool = True
    ) -> None:
        assert maxiter > 0, 'Maximum number of iterations must be positive'
        assert eps > 0.0, 'Error tolerance parameter must be positive'

        super().__init__()
        self.eps = eps
        self.maxiter = maxiter
        self.fit_intercept = fit_intercept
        self.intercept_ = None
        self.coef_ = None
        self.shape = None
        self.niter = None
        self.status = 0
        self.use_smart_seed = use_smart_seed

    def __str__(self) -> str:
        return str({
            'intercept': None if self.intercept_ is None else np.round(self.intercept_, 4),  # noqa: E501
            'coef': None if self.coef_ is None else self.coef_.round(4),
            'shape': None if self.shape is None else np.round(self.shape, 4),
            'niter': self.niter, 'status': self.status
        })

    def fit(self, X: pd.DataFrame, y: pd.Series, verbose: bool = False):
        '''
        Fits the predictive model by maximizing the likelihood function of the
        beta distribution for the expit mean function given a data sample.

        Parameters
        ----------
        `X: pd.DataFrame`
            A feature matrix.
        `y: pd.Series`
            A response array.
        `verbose: bool = False`
            An optional flag controlling the verbosity of outputs. Defaults to
            `False`.

        Raises
        ------
        `AssertionError`
            If one or more response values are out of bounds.
        `AssertionError`
            If the feature matrix (including the added dummy column to account
            for the intercept parameter) is perfectly colinear.
        '''
        assert y.min() > 0.0 and y.max() < 1.0, 'Response is out of bounds'

        X_ = np.c_[np.ones_like(y), X] if self.fit_intercept else X

        assert np.linalg.det(X_.T @ X_) != 0.0, 'Feature matrix displays perfect collinearity'  # noqa: E501

        size, d = X_.shape
        check = False
        niter = 0
        logy = np.log(y)
        logy_ = np.log(1.0 - y)

        p = np.zeros(shape=d + 1)
        g = np.zeros(shape=d + 1)
        h = np.zeros(shape=(d + 1, d + 1))

        if self.use_smart_seed:
            # Transformed linear regression seeds
            reg_ = TransformedTargetRegressor(
                regressor=LinearRegression(fit_intercept=self.fit_intercept),
                func=logit, inverse_func=expit
            ).fit(
                X=X, y=y
            )

            if self.fit_intercept:
                p[0] = reg_.regressor_.intercept_

            p[int(self.fit_intercept):-1] = reg_.regressor_.coef_
            mean = expit(X_ @ p[:-1])
            p[-1] = np.sum(mean * (1.0 - mean)) / np.sum((y - mean) ** 2) - 1.0
        else:
            # Method of moments seeds
            mean = y.mean()

            if self.fit_intercept:
                p[0] = logit(mean)

            p[-1] = mean * (1.0 - mean) / y.var() - 1.0

        while not (check or niter > self.maxiter):
            niter += 1

            coef, shape = p[:-1], p[-1]
            mean = expit(X_ @ coef)

            # dl/db
            g[:-1] = X_.T @ (
                (
                    -1.0 * polygamma(n=0, x=mean * shape) +
                    1.0 * polygamma(n=0, x=(1.0 - mean) * shape) +
                    logy - logy_
                ) * (
                    shape * mean * (1.0 - mean)
                )
            ) / size

            # dl/dp
            g[-1] = (
                size * polygamma(n=0, x=shape) +
                mean.T @ (-polygamma(n=0, x=mean * shape) + logy) +
                (1.0 - mean).T @ (-polygamma(n=0, x=(1.0 - mean) * shape) + logy_)  # noqa: E501
            ) / size

            # d2l/db2
            sigma = (
                (
                    polygamma(n=1, x=mean * shape) +
                    polygamma(n=1, x=(1.0 - mean) * shape)
                ) * (
                    -1.0 * (shape * mean * (1.0 - mean)) ** 2.0
                ) + (
                    -polygamma(n=0, x=mean * shape) +
                    polygamma(n=0, x=(1.0 - mean) * shape) +
                    logy - logy_
                ) * (
                    shape * (1.0 - 2.0 * mean) * mean * (1.0 - mean)
                )
            ) / size

            # h[:-1, :-1] = X_.T @ np.diag(sigma) @ X_
            h[:-1, :-1] = np.multiply(sigma[:, None], X_).T @ X_

            # d2l/dbdp
            h[-1, :-1] = X_.T @ (
                (
                    -shape * mean * polygamma(n=1, x=mean * shape) +
                    shape * (1.0 - mean) * polygamma(n=1, x=(1.0 - mean) * shape) +  # noqa: E501
                    -polygamma(n=0, x=mean * shape) +
                    polygamma(n=0, x=(1.0 - mean) * shape) +
                    logy - logy_
                ) * (
                    mean * (1.0 - mean)
                )
            ) / size

            h[:-1, -1] = h[-1, :-1]

            # d2l/dp2
            h[-1, -1] = (
                size * polygamma(n=1, x=shape) +
                -mean.T @ (polygamma(n=1, x=mean * shape) * mean) +
                -(1.0 - mean).T @ (polygamma(n=1, x=(1.0 - mean) * shape) * (1.0 - mean))  # noqa: E501
            ) / size

            if np.linalg.det(h) == 0.0:
                break

            delta = -np.linalg.inv(a=h) @ g
            loss = np.linalg.norm(x=delta)
            check = loss <= self.eps

            if verbose:
                print(niter, p.round(decimals=4), np.round(loss, decimals=4))

            p += delta

        self.intercept_ = p[0] if self.fit_intercept else 0.0
        self.coef_ = p[int(self.fit_intercept):-1]
        self.shape = p[-1]
        self.niter = niter
        self.status = int(check)

        return self

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        '''
        Evaluates the expected values of the probabilistic model given the
        feature matrix.

        Parameters
        ----------
        `X: pd.DataFrame`
            The feature matrix.

        Returns
        -------
        `np.ndarray`
            An array of predictions.

        Raises
        ------
        `AssertionError`
            If the number of features do not match that provided at fitting.
        `AssertionError`
            If the regressor hasn't been successfully calibration prior to predicting.
        '''
        assert X.shape[1] == len(self.coef_), 'Dimension of feature matrix do not match those of coefficient array'  # noqa: E501
        assert self.status == 1, 'Regressor must be successfully fitted before use'  # noqa: E501

        return expit(self.intercept_ + X @ self.coef_)


if __name__ == '__main__':
    size, d = 100_000, 5
    intercept = np.random.normal(scale=0.1)
    coef = np.random.normal(size=d, scale=0.1)
    shape = 10.0

    X = np.random.normal(size=(size, d))
    loc = expit(intercept + X @ coef)
    y = np.random.beta(a=loc * shape, b=(1.0 - loc) * shape)

    reg = BetaRegressor(use_smart_seed=False).fit(X=X, y=y)

    print({
        'intercept': np.round(intercept, 4), 'coef': coef.round(4),
        'shape': np.round(shape, 4)
    })

    print(reg)

Adding some clarifications:

  • The code is a Sklearn API compliant implementation of a regressor class for the beta regression.
  • For individual data samples, we assume $y_i \mid \boldsymbol{x}_i \sim \mathcal{B}!\left(\mu_i, \phi\right)$, where responses are independant conditional on $\boldsymbol{X}$, the feature matrix.
  • The $\left(\mu, \phi\right) = \left(\tfrac{\alpha}{\alpha + \beta}, \alpha + \beta\right)$ parametrization is used in the same spirit as other more classical forms of regression do (e.g., GLMs).
  • This essentially means that the mean parameter is expressed as a transformation of $\boldsymbol{x}^\mathsf{T}\boldsymbol{\beta}$, the linear combination of all predictive features ($\boldsymbol{x}$) and regression coefficients ($\boldsymbol{\beta}$).
  • The so-called inverse link function, or mean function, is here defined as $\mu!\left(\boldsymbol{x}\right) = {\left{1 + \exp!\left(-\boldsymbol{x}^\mathsf{T}\boldsymbol{\beta}\right)\right}}^{-1}$ -- the expit function -- basically as a way to enforce the support the beta distribution is defined over ($\Omega_y = \left(0, 1\right)$).
  • The numerical solver implemented is Newton's method.
  • The regression coefficients estimates are obtained by maximising the log-likelihood function by way of this numerical solver.
  • Expressions for both gradient and hessian matrix are included within the code, and details of their mathematical derivations are not included here.
  • The seed values provided to the numerical solver rely on either of 2 strategies: "smart seeds" or the method of moments.
  • The method of moments assumes all coefficients are 0 except for the intercept parameter which, if included (via the fit_intercept argument), if evaluated so the theoretical mean matches the sample mean of the response.
  • Likewise the dispersion parameter, $\phi$, is set so that theoretical and sample variances are a match.
  • The smart seed strategy, on the other hand, uses a transformed regressor to fit a linear regression on the logit of the response.
  • Finally, a short illustrative example is provided under the if __name__ == '__main__': line.
Papeete answered 25/7, 2024 at 5:46 Comment(2)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Depreciate
I assume it is much clearer now, although I so wish SO would display LaTeX equations through MathJaX. Oh well.Papeete

© 2022 - 2025 — McMap. All rights reserved.