pythonregressionrobust

get p value and r value from HuberRegressor in Sklearn


I have datasets with some outliers. From the simple linear regression, using

stat_lin = stats.linregress(X, Y)

I can get coefficient, intercept, r_value, p_value, std_err

But I want to apply robust regression method as I don't want to include outliers.

So I applied Huber regressor from Sklearn,

huber = linear_model.HuberRegressor(alpha=0.0, epsilon=1.35)
huber.fit(mn_all_df['X'].to_numpy().reshape(-1, 1), mn_all_df['Y'].to_numpy().reshape(-1, 1))

from that, I can get, coefficient, intercept, scale, outliers.

I am happy with the result as the coefficient value is higher and the regression line is fitting with the majority of the data points.

However, I need a values such as r value and p value to say, the results from huber regressor is significant.

How can I get r value and p value from the robust regression (my case, using huber regressor)


Solution

  • You can also use robust linear models in statsmodels. For example:

    import statsmodels.api as sm
    from sklearn import datasets
    
    x = iris.data[:,0]
    y = iris.data[:,2]
    rlm_model = sm.RLM(y, sm.add_constant(x),
    M=sm.robust.norms.HuberT())
    rlm_results = rlm_model.fit()
    

    The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:

    rlm_results.summary()
                         
    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const         -7.1311      0.539    -13.241      0.000      -8.187      -6.076
    x1             1.8648      0.091     20.434      0.000       1.686       2.044
    ==============================================================================
    

    Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.

    See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared

    How to get R-squared for robust regression (RLM) in Statsmodels?