pythonregressionstatsmodelsstandard-errorlinearmodels

Regression standard error clustering AND robust to heteroskedascity + serial autocorrelation


As indicated in the title, I'm trying to run a regression in python where the standard errors are clustered as well as robust to heteroskedascity and autocorrelation (HAC). I'm working within statsmodels (sm), but obviously open to using other libraries (e.g. linearmodels).

To cluster e.g. by id, the code would be

sm.OLS.from_formula(formula='y ~ x', data=df).fit(cov_type='cluster', cov_kwds={'groups': df['id']}, use_t=True) 

For HAC standard errors, the code would be

sm.OLS.from_formula(formula='y ~ x', data=df).fit(cov_type='HAC', cov_kwds={'maxlags': max_lags}, use_t=True)

Given cov_type can't be both cluster and HAC at the same time, it doesn't seem feasible to do both in statsmodels? Is that right, and/or is there any other way to have both?


Solution

  • There are two panel HAC cov_types hac-groupsum and hac-panel, but I only know their use for panel data, but they should work with clustered data. As far as I remember there was some literature that they are not very good in highly imbalanced data (e.g. comparing population data of US states which differ widely in size).

    https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.get_robustcov_results.html

    The main reference for implementing that was the article by Petersen, e.g.

    https://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/standarderror.html

    Examples for some comparison to Petersen are in the unit tests.

    Statsmodels also has cluster robust standard errors when we have two(way) clusters.

    The stochastic behavior of these covariance matrices depends on whether the number of clusters, the number of time periods or both become large in large samples.