[SOLVED] how to use sklearn when target variable is a proportion

how to use sklearn when target variable is a proportion

Solution

There exists a workaround, but it is not intrinsically within the sklearn framework.

If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:

Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).

There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.

In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).

import numpy as np
from sklearn.linear_model import LinearRegression


class LogitRegression(LinearRegression):

    def fit(self, x, p):
        p = np.asarray(p)
        y = np.log(p / (1 - p))
        return super().fit(x, y)

    def predict(self, x):
        y = super().predict(x)
        return 1 / (np.exp(-y) + 1)


if __name__ == '__main__':
    # generate example data
    np.random.seed(42)
    n = 100
    x = np.random.randn(n).reshape(-1, 1)
    noise = 0.1 * np.random.randn(n).reshape(-1, 1)
    p = np.tanh(x + noise) / 2 + 0.5

    model = LogitRegression()
    model.fit(x, p)

    print(model.predict([[-10], [0.0], [1]]))
    # [[  2.06115362e-09]
    #  [  5.00000000e-01]
    #  [  8.80797078e-01]]

There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.

[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.