I am trying to reproduce fixed effects coefficients for panel data using different packages and techniques: (1) plm()
, (2) lfe()
, (3) dummy-lsdv with lm()
, and (4) demeaned-fe with lm()
.
My data set consists of 1581 observations and 13 variables. It is survey data from 527 respondents (var = respondent) in 3 waves (var = wave). I have one DV (y) and 10 IVs (x1 to x10).
The data set looks like this:
respondent wave y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 2 NA NA 2 1 1.5 NA NA 2
2 1 2 NA 2 NA 0 0 0 1 4 4 1 3
3 1 3 NA 4 5 NA NA NA NA 8 NA NA 1
4 2 1 0.931 3 3 2 2 2 4 7.5 7.5 NA 3
5 2 2 0.986 4 NA NA 2 2 4.5 6.5 5 3 4
6 2 3 0.986 4 3 2 2 2 3 3 3 2 3
My question: When I perform fixed effects regressions with (1) plm()
, (2) lfe()
, and (3) dummy-lsdv with lm()
, the models always return the same coefficients. However, when I perform a fixed effects regression using (4) demeaned data and the lm()
package, I get different coefficients. This puzzles me and I wonder: why?
Here is my code:
1. plm()
:
Input:
library(plm)
model_plm <- plm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10,
data = dataset,
index=c("respondent","wave"),
model = "within",
effect = 'individual')
summary(model_plm)
Output:
Unbalanced Panel: n = 228, T = 1-2, N = 316
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.3240866 -0.0048416 0.0000000 0.0048416 0.3240866
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
x1 -0.0216484 0.0167614 -1.2916 0.20032
x2 0.0178114 0.0141219 1.2613 0.21097
x3 -0.0145262 0.0103954 -1.3974 0.16627
x4 -0.0061660 0.0133069 -0.4634 0.64439
x5 0.0174401 0.0144256 1.2090 0.23032
x6 -0.0053556 0.0067210 -0.7968 0.42796
x7 0.0065517 0.0097627 0.6711 0.50415
x8 -0.0151375 0.0081992 -1.8462 0.06865 .
x9 0.0235351 0.0092612 2.5412 0.01303 *
x10 0.0235181 0.0228927 1.0273 0.30745
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2. lfe()
:
Input:
library(lfe)
model_lfe <- felm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 | respondent, data = dataset)
summary(model_lfe)
Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.021648 0.016761 -1.292 0.2003
x2 0.017811 0.014122 1.261 0.2110
x3 -0.014526 0.010395 -1.397 0.1663
x4 -0.006166 0.013307 -0.463 0.6444
x5 0.017440 0.014426 1.209 0.2303
x6 -0.005356 0.006721 -0.797 0.4280
x7 0.006552 0.009763 0.671 0.5041
x8 -0.015138 0.008199 -1.846 0.0687 .
x9 0.023535 0.009261 2.541 0.0130 *
x10 0.023518 0.022893 1.027 0.3074
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3. LSDV with lm()
:
Input:
model_lsdv <- lm(y ~ as_factor(respondent) + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data = dataset)
options(max.print=2000)
summary(model_lsdv)
Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9499746 0.1505806 6.309 1.57e-08 ***
[...]
x1 -0.0216484 0.0167614 -1.292 0.20032
x2 0.0178114 0.0141219 1.261 0.21097
x3 -0.0145262 0.0103954 -1.397 0.16627
x4 -0.0061660 0.0133069 -0.463 0.64439
x5 0.0174401 0.0144256 1.209 0.23032
x6 -0.0053556 0.0067210 -0.797 0.42796
x7 0.0065517 0.0097627 0.671 0.50415
x8 -0.0151375 0.0081992 -1.846 0.06865 .
x9 0.0235351 0.0092612 2.541 0.01303 *
x10 0.0235181 0.0228927 1.027 0.30745
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4. Demeaned FE with lm()
:
Input:
dataset_demeaned <- with(dataset, data.frame(respondent = respondent,
wave = wave,
y = y - ave(y, respondent, FUN=function(x) mean(x, na.rm=T)),
x1 = x1 - ave(x1, respondent, FUN=function(x) mean(x, na.rm=T)),
x2 = x2 - ave(x2, respondent, FUN=function(x) mean(x, na.rm=T)),
x3 = x3 - ave(x3, respondent, FUN=function(x) mean(x, na.rm=T)),
x4 = x4 - ave(x4, respondent, FUN=function(x) mean(x, na.rm=T)),
x5 = x5 - ave(x5, respondent, FUN=function(x) mean(x, na.rm=T)),
x6 = x6 - ave(x6, respondent, FUN=function(x) mean(x, na.rm=T)),
x7 = x7 - ave(x7, respondent, FUN=function(x) mean(x, na.rm=T)),
x8 = x8 - ave(x8, respondent, FUN=function(x) mean(x, na.rm=T)),
x9 = x9 - ave(x9, respondent, FUN=function(x) mean(x, na.rm=T)),
x10 = x10 - ave(x10, respondent, FUN=function(x) mean(x, na.rm=T))
)
)
model_dmd <- lm(y ~ 0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data = dataset_demeaned)
summary(model_dmd)
Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.006223 0.008220 -0.757 0.44957
x2 0.013181 0.007880 1.673 0.09543 .
x3 -0.012807 0.005484 -2.335 0.02018 *
x4 -0.006431 0.006311 -1.019 0.30900
x5 0.015455 0.005941 2.602 0.00973 **
x6 -0.001429 0.003402 -0.420 0.67483
x7 0.004362 0.004698 0.929 0.35387
x8 -0.009336 0.004366 -2.139 0.03326 *
x9 0.015731 0.005267 2.987 0.00305 **
x10 0.007631 0.010922 0.699 0.48529
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
More information:
I have already performed these checks:
demean()
function. --> Same results as in 4.ave()
and the demean()
functions.na.action
option because I hoped that the problem might result from a different treatment of missing values. But it did not change the results.as_factor
in the (4) demeaned fe model. Like: model_dmd <- lm(y ~ 0 + as_factor(respondent) + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data = dataset_demeaned)
. This approach reproduced the correct coefficients. However, demeaning should already take care of unobserved heterogeneity and thus the inclusion of dummies seems redundant.So my best guess is that the problem does not come from the process of demeaning, but from the lm()
function. Maybe the fact that the panel is unbalanced
plays a role here?
I am very thankful for any suggestions and explanations!
SOLUTION:
Thanks to @G.Grothendieck, I can post the solution here. The correct code of (4) Demeaned FE with lm()
should read as:
Input:
# Delete all rows with NAs
dataset <- na.omit(dataset)
# Demean the rows that are left behind
dataset_demeaned <- with(dataset, data.frame(respondent = respondent,
wave = wave,
y = y - ave(y, respondent, FUN=function(x) mean(x, na.rm=T)),
x1 = x1 - ave(x1, respondent, FUN=function(x) mean(x, na.rm=T)),
x2 = x2 - ave(x2, respondent, FUN=function(x) mean(x, na.rm=T)),
x3 = x3 - ave(x3, respondent, FUN=function(x) mean(x, na.rm=T)),
x4 = x4 - ave(x4, respondent, FUN=function(x) mean(x, na.rm=T)),
x5 = x5 - ave(x5, respondent, FUN=function(x) mean(x, na.rm=T)),
x6 = x6 - ave(x6, respondent, FUN=function(x) mean(x, na.rm=T)),
x7 = x7 - ave(x7, respondent, FUN=function(x) mean(x, na.rm=T)),
x8 = x8 - ave(x8, respondent, FUN=function(x) mean(x, na.rm=T)),
x9 = x9 - ave(x9, respondent, FUN=function(x) mean(x, na.rm=T)),
x10 = x10 - ave(x10, respondent, FUN=function(x) mean(x, na.rm=T))
)
)
model_dmd <- lm(y ~ 0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10, data = dataset_demeaned)
summary(model_dmd)
Output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.021648 0.008462 -2.558 0.011004 *
x2 0.017811 0.007130 2.498 0.013009 *
x3 -0.014526 0.005248 -2.768 0.005989 **
x4 -0.006166 0.006718 -0.918 0.359452
x5 0.017440 0.007283 2.395 0.017240 *
x6 -0.005356 0.003393 -1.578 0.115530
x7 0.006552 0.004929 1.329 0.184768
x8 -0.015138 0.004140 -3.657 0.000301 ***
x9 0.023535 0.004676 5.033 8.24e-07 ***
x10 0.023518 0.011558 2.035 0.042734 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By demeaning each column separately this is handling the NAs inconsistently. Each row has to be used or not used. One can't use a row for one variable and not for another.