r logistic-regression mlogit multicollinearity

Remove linear dependent variables while using the bife package

Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()) in R. With the bife package, this does not seem to be possible. As stated in the package description in CRAN on page 5:

If bife does not converge this is usually a sign of linear dependence between one or more regressors and the fixed effects. In this case, you should carefully inspect your model specification.

Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?

I set a code as an example below:

#sample coding

x=10*rnorm(40)
z=100*rnorm(40)

df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)

df3=rbind(df1,df2)

for(i in 1:4) {
  
  x=df3[df3$Region==i,]
  
  model =  bife::bife(a ~ x + y + z | ID, data = x)
  
  results=data.frame(Region=unique(df3$Region))
  
  results$Model = results

  if (i==1){
      df4=df
      next
  }

df4=rbind(df4,df)

  
} 

Error: Linear dependent terms detected!

Solution

Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm.

Here's an example of solution with the package fixest:

library(bife)
library(fixest)

x = 10*rnorm(40)
z = 100*rnorm(40)

df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))

df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))

df3 = rbind(df1, df2)

vars = c("x", "y", "z")

res_all = list()
for(i in 1:4) {
    x = df3[df3$Region == i, ]

    coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
    new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
    res_all[[i]] = bife::bife(new_fml, data = x)
}

# Display all results
for(i in 1:4) {
    cat("\n#\n# Region: ", i, "\n#\n\n")
    print(summary(res_all[[i]]))
}

The functions needed here are feols and xpd, the two are from fixest. Some explanations:

feols, like lm, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var (if none is found, it's NULL).
Contrary to lm, feols also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE otherwise feols would have prompted a note referring to collinearity.
feols is fast (actually faster than lm for large data sets) so won't be a strain on your analysis.
The function xpd expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
- When the arguments of xpd are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y") is provided, the formula a ~ ..vars | ID will become a ~ x + y | ID.
- Here it replaces ..vars in the formula by setdiff(vars, coll_vars)), which is the vector of variables that were not found to be collinear.

So you get an algorithm with automatic variable removal before performing bife estimations.

Finally, just a side comment: in general it's better to store results in lists since it avoids copies.

Update

I forgot, but if you don't need bias correction (bife::bias_corr), then you can directly use fixest::feglm which automatically removes collinear variables:

res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)

rbind(coef(res_bife), coef(res_feglm))
#>                x          z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990