[SOLVED] Difference between rlm() and lm

Difference between rlm() and lm_robust

Due to outliers, I would like to use a robust regression method instead of lm().

I can't decide whether to use lm_robust (estimatr package) or rlm (MASS package). Are there mathematical differences between the functions? Which one would you recommend?

Solution

'Robust' is one of those terms like 'exact' or 'weighted' in statistics; they are applied equally to what may turn out to be very different approaches. As the Wikipedia article for robust regression explains there are broadly at least two issues that can cause regular linear regression to go awry:

outliers, or highly influential observations that due to their location have a much larger impact on parameter estimates than other observations if they are in- or excluded from the fit.
lack of homoscedasticity or homogeneity of variance (all errors being identically and independently distributed, a key assumption in many statistical models).

Each of these has their own remedies, and the packages you mention implement some of those: MASS::rlm and robustbase::lmrob fit so-called M-estimators, which can have higher breakdown points (tolerance for outliers) than regular maximum likelihood estimators. estimatr::lm_robust rather fits a heteroscedasticity-consistent variance, more broadly a sandwich (co)variance estimator, which relaxes the homoscedasticity assumption.

From your comments it seems like your data might primarily be affected by outliers, so MASS::rlm or robustbase::lmrob would then be the better options to try first. What's more, using for example the sandwich package you can obtain a sandwich covariance estimate for many different kinds of models fit in R, so a combination might get you the best of both.