I have datasets with some outliers. From the simple linear regression, using
stat_lin = stats.linregress(X, Y)
I can get coefficient, intercept, r_value, p_value, std_err
But I want to apply robust regression method as I don't want to include outliers.
So I applied Huber regressor from Sklearn,
huber = linear_model.HuberRegressor(alpha=0.0, epsilon=1.35)
huber.fit(mn_all_df['X'].to_numpy().reshape(-1, 1), mn_all_df['Y'].to_numpy().reshape(-1, 1))
from that, I can get, coefficient, intercept, scale, outliers.
I am happy with the result as the coefficient value is higher and the regression line is fitting with the majority of the data points.
However, I need a values such as r value and p value to say, the results from huber regressor is significant.
How can I get r value and p value from the robust regression (my case, using huber regressor)
You can also use robust linear models in statsmodels. For example:
import statsmodels.api as sm
from sklearn import datasets
x = iris.data[:,0]
y = iris.data[:,2]
rlm_model = sm.RLM(y, sm.add_constant(x),
M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:
rlm_results.summary()
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -7.1311 0.539 -13.241 0.000 -8.187 -6.076
x1 1.8648 0.091 20.434 0.000 1.686 2.044
==============================================================================
Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.
See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared
How to get R-squared for robust regression (RLM) in Statsmodels?