I am trying to replicate the Stata code (available here) in R. The idea is randomly selecting bootstrap sample from the idcode, not the idcode-year and then estimate the parameters from the selected samples and save them in an Stata datafile. Cluster and idcluster options in the Stata code are used to force Stata to randomly select the bootstrap samples from the idcodes, not the idcode-year.
The below R code does the sampling part per @PBulls suggestion. I wonder how to modify the R code so it replicates the Stata code also pasted below.
Here is the R code:
data <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
dplyr::filter(!is.na(ttl_exp) & !is.na(hours))
panelboot <- function(df) {
ids <- unique(df[["idcode"]])
data.frame(
idcode = sample(ids, replace=TRUE),
nidcode = seq_along(ids)
) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
}
set.seed(1)
boot1 <- panelboot(data)
## ID 1130 is selected 3 times total, contributing 8 observations each time
fid <- which(boot1[,1] == boot1[1,1])
## The resamples of ID 1130 have new IDs 1, 933, 1671
boot1[fid,2]
Here is the Stata code and its output:
use "nslwork.dta", clear
*you need below for bootstraping from observation not person-years
tsset, clear
generate long newid = idcode // long is not about the data format but it refers to the variable format
tsset newid year
capture program drop savemargins
program savemargins, rclass
reg ttl_exp hours
end
bootstrap _b, saving(boot_output.dta, replace ) reps(10) cluster(idcode) idcluster(newid) : savemargins
Here is Stata output:
Bootstrap replications (10): .........10 done
Linear regression Number of obs = 28,467
Replications = 10
Wald chi2(1) = 194.08
Prob > chi2 = 0.0000
R-squared = 0.0118
Adj R-squared = 0.0118
Root MSE = 4.6267
(Replications based on 4,710 clusters in idcode)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
ttl_exp | coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
hours | .0512427 .0036782 13.93 0.000 .0440335 .0584519
_cons | 4.343879 .1302096 33.36 0.000 4.088673 4.599085
------------------------------------------------------------------------------
From the comments, here's code that will perform cluster resampling & perform a "pooled OLS":
N_SAMPLES <- 100
df <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
dplyr::filter(!is.na(ttl_exp) & !is.na(hours))
panelboot <- function(df) {
ids <- unique(df[["idcode"]])
data.frame(
idcode = sample(ids, replace=TRUE),
nidcode = seq_along(ids)
) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
}
reg <- function(df, id="nidcode") {
coef(plm::plm(ttl_exp ~ hours, data = df, model="pooling", index=c(id, "year")))
}
set.seed(1)
t0 <- reg(df, id="idcode")
t <- t(vapply(seq_len(N_SAMPLES), \(d) reg(panelboot(df)), numeric(2)))
boot_se <- matrixStats::colSds(t)
cbind(t0, boot_se)
#> Estimate SE
#> (Intercept) 4.343879 0.141799
#> hours 0.051243 0.003935
t(vapply(seq_along(t0), \(i) boot::norm.ci(t0=t0[i], t=t[,i])[2:3], numeric(2)))
#> Normal-based 95% CI
#> (Intercept) 4.044749 4.627206
#> hours 0.043783 0.059056
A few pointers to some other questions that came up:
vapply
call, I've moved this to the top.t
(in line with boot
's parameter naming). This is an N*P matrix, with N the number of bootstrap samples (rows) and P the number of bootstrapped parameters (columns). I subsequently calculate the standard deviations of each column (matrixStats::colSds
). You were previously filling and possibly overwriting this object very differently, in that case you may have to calculate row-wise SDs or something else entirely.