pythonpandasstatastatsmodelslinearmodels

Multicollinearity in Panel Data in Python


I am used to using Stata or R to do linear regression models but I am transitioning more workflow over to Python.

The useful thing about these two programs is that they intuitively know that you do not care about all of the entity- or time-fixed effects in a linear model, so when estimating panel models, they will drop multicollinear dummies from the model (reporting which ones they drop).

While I understand that estimating models in such a way is not ideal and one should be careful about regressions to run (etc), this is useful in practice, because it means that you can see results first, and worry about some of the nuances of the dummies later (especially since you don't care about dummies in a fully saturated Fixed-Effects model).

Let me provide an example. The following requires linearmodels and loads a dataset and attempts to run a panel regression. It is a modified version of the example from their documentation.

# Load the data (requires statsmodels and linearmodels)
import statsmodels.api as sm
from linearmodels.datasets import wage_panel
import pandas as pd
data = wage_panel.load()
year = pd.Categorical(data.year)
data = data.set_index(['nr', 'year'])
data['year'] = year
print(wage_panel.DESCR)
print(data.head())

# Run the panel regression
from linearmodels.panel import PanelOLS
exog_vars = ['exper','union','married']
exog = sm.add_constant(data[exog_vars])
mod = PanelOLS(data.lwage, exog, entity_effects=True, time_effects=True)
fe_te_res = mod.fit()
print(fe_te_res)

This gives the following error:

AbsorbingEffectError: The model cannot be estimated. The included effects have fully absorbed one or more of the variables. This occurs when one or more of the dependent variable is perfectly explained using the effects included in the model.

However, if you estimate in Stata by exporting the same data to Stata, running:

data.drop(columns='year').to_stata('data.dta')

And then running the equivalent in your stata file (after loading in the data):

xtset nr year
xtreg lwage exper union married i.year, fe

This will do the following:

> . xtreg lwage exper union married i.year, fe
note: 1987.year omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =      4360
Group variable: nr                              Number of groups   =       545

R-sq:  within  = 0.1689                         Obs per group: min =         8
       between = 0.0000                                        avg =       8.0
       overall = 0.0486                                        max =         8

                                                F(9,3806)          =     85.95
corr(u_i, Xb)  = -0.1747                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       exper |   .0638624   .0032594    19.59   0.000     .0574721    .0702527
       union |   .0833697   .0194393     4.29   0.000     .0452572    .1214821
     married |   .0583372   .0183688     3.18   0.002     .0223235    .0943509
             |
        year |
       1981  |   .0496865   .0200714     2.48   0.013     .0103348    .0890382
       1982  |   .0399445    .019123     2.09   0.037     .0024521    .0774369
       1983  |   .0193513    .018662     1.04   0.300    -.0172373    .0559398
       1984  |   .0229574   .0186503     1.23   0.218    -.0136081    .0595229
       1985  |   .0081499   .0191359     0.43   0.670    -.0293677    .0456674
       1986  |   .0036329   .0200851     0.18   0.856    -.0357456    .0430115
       1987  |          0  (omitted)
             |
       _cons |   1.169184   .0231221    50.57   0.000     1.123851    1.214517
-------------+----------------------------------------------------------------
     sigma_u |  .40761229
     sigma_e |  .35343397
         rho |  .57083029   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Notice that stata arbitrarily dropped 1987 from the regression, but still ran. Is there a way to get similar functionality in linearmodels or statsmodels?


Solution

  • The only way I can think to do it is manually.

    Sample Data

    import pandas as pd
    import numpy as np
    import statsmodels.api as sm
    
    from linearmodels.datasets import wage_panel
    from linearmodels.panel import PanelOLS
    
    data = wage_panel.load()
    

    First, we'll follow in Stata's footsteps, generating dummies for each of the year fixed effects and we leave out the first value, lexicographically sorted, (accomplished with the drop_first=True argument). It's important to use np.unique to then get the labels, as this sorts too. No need for statsmodels to add a constant, just do it yourself:

    data = wage_panel.load()
    data = pd.concat([data, pd.get_dummies(data.year, drop_first=True)], axis=1)
    exog_vars = ['exper','union','married'] + np.unique(data.year)[1::].tolist()
    data = data.set_index(['nr', 'year'])
    
    exog = data[exog_vars].assign(constant=1)
    

    If we try to run this model, it fails because of perfect collinearity. Because we're doing a within-regression we can't just test collinearity on exog, we need to first de-mean within the panel, as the collinearity of the de-meaned independent variables is what matters. I'll make a copy here as to not mess with exog which we will use in the ultimate regression.

    exog2 = exog.copy()
    exog2 = exog2 - exog2.groupby(level=0).transform('mean') + exog2.mean()
    

    We can now see there is perfect collinearity; when we regress a within de-meaned exog variable against every other variable we get a perfect R-squared for some regressions:

    for col in exog2:
        print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)
    
    #exper 1.0
    #union 0.004326532427527674
    #married 0.18901677578724896
    #1981 1.0
    #1982 1.0
    #1983 1.0
    #1984 1.0
    #1985 1.0
    #1986 1.0
    #1987 1.0
    

    Now how Stata or other software packages decide which variable to drop is a mystery to me. In this case, you'd probably favor dropping one of your year dummies over the other variables. Let's just pick 1987 so we can get the same result as Stata in the end.

    exog2 = exog2.drop(columns=1987)
    for col in exog2:
        print(col, sm.OLS(exog2[col], exog2.drop(columns=col)).fit().rsquared)
    
    #exper 0.48631
    #union 0.0043265
    #married 0.1890
    #1981 0.34978
    #1982 0.28369
    #1983 0.2478680
    #1984 0.2469180
    #1985 0.2846552
    #1986 0.35067075
    #constant 0.94641
    

    So we've dealt with the collinearity and can go back to the model. As we've included the yearly Fixed Effects manually, we can remove time_effects from the model.

    mod = PanelOLS(data.lwage, exog.drop(columns=1987), entity_effects=True, time_effects=False)
    print(mod.fit())
    
                              PanelOLS Estimation Summary                           
    ================================================================================
    Dep. Variable:                  lwage   R-squared:                        0.1689
    Estimator:                   PanelOLS   R-squared (Between):             -0.0882
    No. Observations:                4360   R-squared (Within):               0.1689
    Date:                Sat, Mar 09 2019   R-squared (Overall):              0.0308
    Time:                        00:59:14   Log-likelihood                   -1355.7
    Cov. Estimator:            Unadjusted                                           
                                            F-statistic:                      85.946
    Entities:                         545   P-value                           0.0000
    Avg Obs:                       8.0000   Distribution:                  F(9,3806)
    Min Obs:                       8.0000                                           
    Max Obs:                       8.0000   F-statistic (robust):             85.946
                                            P-value                           0.0000
    Time periods:                       8   Distribution:                  F(9,3806)
    Avg Obs:                       545.00                                           
    Min Obs:                       545.00                                           
    Max Obs:                       545.00                                           
    
                                 Parameter Estimates                              
    ==============================================================================
                Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
    ------------------------------------------------------------------------------
    exper          0.0639     0.0033     19.593     0.0000      0.0575      0.0703
    union          0.0834     0.0194     4.2887     0.0000      0.0453      0.1215
    married        0.0583     0.0184     3.1759     0.0015      0.0223      0.0944
    1981           0.0497     0.0201     2.4755     0.0133      0.0103      0.0890
    1982           0.0399     0.0191     2.0888     0.0368      0.0025      0.0774
    1983           0.0194     0.0187     1.0369     0.2998     -0.0172      0.0559
    1984           0.0230     0.0187     1.2309     0.2184     -0.0136      0.0595
    1985           0.0081     0.0191     0.4259     0.6702     -0.0294      0.0457
    1986           0.0036     0.0201     0.1809     0.8565     -0.0357      0.0430
    constant       1.1692     0.0231     50.566     0.0000      1.1239      1.2145
    ==============================================================================
    

    Which matches the Stata output. There wasn't really any reason to drop 1987, we could have chosen something else, but at least with this we can see the results match xtreg.