Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()
) in R
. With the bife
package, this does not seem to be possible. As stated in the package description in CRAN on page 5:
If bife does not converge this is usually a sign of linear dependence between one or more regressors and the fixed effects. In this case, you should carefully inspect your model specification.
Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?
I set a code as an example below:
#sample coding
x=10*rnorm(40)
z=100*rnorm(40)
df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)
df3=rbind(df1,df2)
for(i in 1:4) {
x=df3[df3$Region==i,]
model = bife::bife(a ~ x + y + z | ID, data = x)
results=data.frame(Region=unique(df3$Region))
results$Model = results
if (i==1){
df4=df
next
}
df4=rbind(df4,df)
}
Error: Linear dependent terms detected!
Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm
.
Here's an example of solution with the package fixest
:
library(bife)
library(fixest)
x = 10*rnorm(40)
z = 100*rnorm(40)
df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3 = rbind(df1, df2)
vars = c("x", "y", "z")
res_all = list()
for(i in 1:4) {
x = df3[df3$Region == i, ]
coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
res_all[[i]] = bife::bife(new_fml, data = x)
}
# Display all results
for(i in 1:4) {
cat("\n#\n# Region: ", i, "\n#\n\n")
print(summary(res_all[[i]]))
}
The functions needed here are feols
and xpd
, the two are from fixest
. Some explanations:
feols
, like lm
, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var
(if none is found, it's NULL
).
Contrary to lm
, feols
also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE
otherwise feols
would have prompted a note referring to collinearity.
feols
is fast (actually faster than lm
for large data sets) so won't be a strain on your analysis.
The function xpd
expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
When the arguments of xpd
are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y")
is provided, the formula a ~ ..vars | ID
will become a ~ x + y | ID
.
Here it replaces ..vars
in the formula by setdiff(vars, coll_vars))
, which is the vector of variables that were not found to be collinear.
So you get an algorithm with automatic variable removal before performing bife
estimations.
Finally, just a side comment: in general it's better to store results in lists since it avoids copies.
I forgot, but if you don't need bias correction (bife::bias_corr
), then you can directly use fixest::feglm
which automatically removes collinear variables:
res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)
rbind(coef(res_bife), coef(res_feglm))
#> x z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990