I am used to using Stata or R to do linear regression models but I am transitioning more workflow over to Python.
The useful thing about these two programs is that they intuitively know that you do not care about all of the entity- or time-fixed effects in a linear model, so when estimating panel models, they will drop multicollinear dummies from the model (reporting which ones they drop).
While I understand that estimating models in such a way is not ideal and one should be careful about regressions to run (etc), this is useful in practice, because it means that you can see results first, and worry about some of the nuances of the dummies later (especially since you don't care about dummies in a fully saturated Fixed-Effects model).
Let me provide an example. The following requires linearmodels
and loads a dataset and attempts to run a panel regression. It is a modified version of the example from their documentation.
# Load the data (requires statsmodels and linearmodels)
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())
# Run the panel regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)
This gives the following error:
AbsorbingEffectError: The model cannot be estimated. The included effects have fully absorbed one or more of the variables. This occurs when one or more of the dependent variable is perfectly explained using the effects included in the model.
However, if you estimate in Stata by exporting the same data to Stata, running:
data.drop(columns='year').to_stata('data.dta')
And then running the equivalent in your stata file (after loading in the data):
xtset nr year
xtreg lwage exper union married i.year, fe
This will do the following:
> . xtreg lwage exper union married i.year, fe
note: 1987.year omitted because of collinearity
Fixed-effects (within) regression Number of obs = 4360
Group variable: nr Number of groups = 545
R-sq: within = 0.1689 Obs per group: min = 8
between = 0.0000 avg = 8.0
overall = 0.0486 max = 8
F(9,3806) = 85.95
corr(u_i, Xb) = -0.1747 Prob > F = 0.0000
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exper | .0638624 .0032594 19.59 0.000 .0574721 .0702527
union | .0833697 .0194393 4.29 0.000 .0452572 .1214821
married | .0583372 .0183688 3.18 0.002 .0223235 .0943509
|
year |
1981 | .0496865 .0200714 2.48 0.013 .0103348 .0890382
1982 | .0399445 .019123 2.09 0.037 .0024521 .0774369
1983 | .0193513 .018662 1.04 0.300 -.0172373 .0559398
1984 | .0229574 .0186503 1.23 0.218 -.0136081 .0595229
1985 | .0081499 .0191359 0.43 0.670 -.0293677 .0456674
1986 | .0036329 .0200851 0.18 0.856 -.0357456 .0430115
1987 | 0 (omitted)
|
_cons | 1.169184 .0231221 50.57 0.000 1.123851 1.214517
-------------+----------------------------------------------------------------
sigma_u | .40761229
sigma_e | .35343397
rho | .57083029 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Notice that stata arbitrarily dropped 1987 from the regression, but still ran. Is there a way to get similar functionality in linearmodels
or statsmodels
?
The only way I can think to do it is manually.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
from linearmodels.panel import PanelOLS
data = wage_panel.load()
First, we'll follow in Stata's footsteps, generating dummies for each of the year fixed effects and we leave out the first value, lexicographically sorted, (accomplished with the drop_first=True
argument). It's important to use np.unique
to then get the labels, as this sorts too. No need for statsmodels
to add a constant, just do it yourself:
data = wage_panel.load()
data = pd.concat([data, pd.get_dummies(data.year, drop_first=True)], axis=1)
exog_vars = ['exper','union','married'] + np.unique(data.year)[1::].tolist()
data = data.set_index(['nr', 'year'])
exog = data[exog_vars].assign(constant=1)
If we try to run this model, it fails because of perfect collinearity. Because we're doing a within-regression we can't just test collinearity on exog, we need to first de-mean within the panel, as the collinearity of the de-meaned independent variables is what matters. I'll make a copy here as to not mess with exog
which we will use in the ultimate regression.
exog2 = exog.copy()
exog2 = exog2 - exog2.groupby(level=0).transform('mean') + exog2.mean()
We can now see there is perfect collinearity; when we regress a within de-meaned exog variable against every other variable we get a perfect R-squared for some regressions:
for col in exog2:
print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)
#exper 1.0
#union 0.004326532427527674
#married 0.18901677578724896
#1981 1.0
#1982 1.0
#1983 1.0
#1984 1.0
#1985 1.0
#1986 1.0
#1987 1.0
Now how Stata or other software packages decide which variable to drop is a mystery to me. In this case, you'd probably favor dropping one of your year dummies over the other variables. Let's just pick 1987 so we can get the same result as Stata in the end.
exog2 = exog2.drop(columns=1987)
for col in exog2:
print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)
#exper 0.48631
#union 0.0043265
#married 0.1890
#1981 0.34978
#1982 0.28369
#1983 0.2478680
#1984 0.2469180
#1985 0.2846552
#1986 0.35067075
#constant 0.94641
So we've dealt with the collinearity and can go back to the model. As we've included the yearly Fixed Effects manually, we can remove time_effects
from the model.
mod = PanelOLS(data.lwage, exog.drop(columns=1987), entity_effects=True, time_effects=False)
print(mod.fit())
PanelOLS Estimation Summary
================================================================================
Dep. Variable: lwage R-squared: 0.1689
Estimator: PanelOLS R-squared (Between): -0.0882
No. Observations: 4360 R-squared (Within): 0.1689
Date: Sat, Mar 09 2019 R-squared (Overall): 0.0308
Time: 00:59:14 Log-likelihood -1355.7
Cov. Estimator: Unadjusted
F-statistic: 85.946
Entities: 545 P-value 0.0000
Avg Obs: 8.0000 Distribution: F(9,3806)
Min Obs: 8.0000
Max Obs: 8.0000 F-statistic (robust): 85.946
P-value 0.0000
Time periods: 8 Distribution: F(9,3806)
Avg Obs: 545.00
Min Obs: 545.00
Max Obs: 545.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exper 0.0639 0.0033 19.593 0.0000 0.0575 0.0703
union 0.0834 0.0194 4.2887 0.0000 0.0453 0.1215
married 0.0583 0.0184 3.1759 0.0015 0.0223 0.0944
1981 0.0497 0.0201 2.4755 0.0133 0.0103 0.0890
1982 0.0399 0.0191 2.0888 0.0368 0.0025 0.0774
1983 0.0194 0.0187 1.0369 0.2998 -0.0172 0.0559
1984 0.0230 0.0187 1.2309 0.2184 -0.0136 0.0595
1985 0.0081 0.0191 0.4259 0.6702 -0.0294 0.0457
1986 0.0036 0.0201 0.1809 0.8565 -0.0357 0.0430
constant 1.1692 0.0231 50.566 0.0000 1.1239 1.2145
==============================================================================
Which matches the Stata output. There wasn't really any reason to drop 1987, we could have chosen something else, but at least with this we can see the results match xtreg.