In the outer loop, I create a formula for the fixed effects of my model, looping through columns (each column is an independant variable).
In the inner loop, I select a subset of years on which to run my model, then run the model, then store the results (coefficient, standard error, p value) in a dataframe. Each new iteration of the model is combined to the previous with rbind. The rolling window is 30 years long and advances 1 year at each iteration, giving about 40 different models for each independant variable.
1- Is there a more efficient way to do this? 2- I'm still unsure how to store the final result (before closing the outer loop) 3- Also I get this error when running the lme in the inner loop: Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars
this is my code:
treeIDs <- factor(1:50)
years <- 1950:2020
data <- data.frame(
year = rep(years, times = length(treeIDs)),
treeID = rep(treeIDs, each = length(years)), # group (for random effect)
rwi = rnorm(length(treeIDs) * length(years)), # dependant var
value1 = rnorm(length(treeIDs) * length(years)), # independant var
value2 = rnorm(length(treeIDs) * length(years)),
value3 = rnorm(length(treeIDs) * length(years))
)
window_size <- 30
step_size <- 1
# empty dataframe to store results
final_dataframe <- data.frame(start_year = integer(),
end_year = integer(),
estimate = numeric(),
se = numeric(),
p_value = numeric()
)
ind_variables <- data[c("value1","value2","value3")]
# outer loop
for (col_name in ind_variables) {
model_formula <- formula(paste("rwi ~", scale(col_name)))
# inner loop
for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
# Define the end year of the window
end_year <- start_year + window_size - 1
# Subset data for the current window
subset_data <- data %>%
filter(year >= start_year & year <= end_year)
# Fit model
model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
na.action = na.exclude, method = "ML")
# Extract info from the regression summary
estimate <- summary(model)$tTable[, c("Value")]
se <- summary(model)$tTable[, c("Std.Error")]
p_value <- summary(model)$tTable[,c("p-value")]
# Store the results
result <- data.frame(start_year = start_year,
end_year = end_year,
estimate = estimate,
se = se,
p_value = p_value)
# combine result to final_dataframe
final_dataframe <- rbind(final_dataframe, result)
}
# how to make a list with the 3 final_dataframes ???
}
I would like the final output to be a list of dataframes (one dataframe for each independant variable)
I'll try to answer your questions in reverse order.
formula()
function creates a character string, so it will not recognize the scale
function or col_name
object. I did the scaling up front when defining the data frame so that we can make a clean formula.lapply()
to get an output as a list. In each inner loop, we can return the data frame for that independent variable. The apply function will then give a result as a list of data frames.summary(model)$tTable
instead of naming variables individually.library(dplyr)
library(nlme)
treeIDs <- factor(1:50)
years <- 1950:2020
data <- data.frame(
year = rep(years, times = length(treeIDs)),
treeID = rep(treeIDs, each = length(years)),
rwi = rnorm(length(treeIDs) * length(years)),
# Scaling these values up front
value1 = scale(rnorm(length(treeIDs) * length(years))),
value2 = scale(rnorm(length(treeIDs) * length(years))),
value3 = scale(rnorm(length(treeIDs) * length(years)))
)
window_size <- 30
step_size <- 1
# Just using the names of the independent variables
ind_variables <- c("value1","value2","value3")
# Use lapply to give a list as an output with one element for each ind_var
results <- lapply(ind_variables, \(col_name) {
# Initialize empty data frame for each independent variable. Note that this is
# inside of the lapply function, so each new ind_var gets its own data frame.
var_df <- data.frame()
# Make formula using character string name of each independent variable
model_formula <- formula(paste("rwi ~", col_name))
# inner loop
for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
# Define the end year of the window
end_year <- start_year + window_size - 1
# Subset data for the current window
subset_data <- data %>%
filter(year >= start_year & year <= end_year)
# Fit model
model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
na.action = na.exclude, method = "ML")
# Pull out the model coefficients into a data frame.
coefs <- summary(model)$tTable %>%
as.data.frame() %>%
slice(2) # Note - only keeping coefs for the var, not the intercept
# Remove rownames from coefs
rownames(coefs) <- NULL
# Put start year, end year, and ind_var coefs together
row <- cbind(start_year, end_year, coefs)
# Add the results for this year into the df for the ind_var
var_df <- rbind(var_df, row)
}
# Return the full DF for each ind_var
return(var_df)
}) %>%
# Rename the list with the ind_variables names
setNames(ind_variables)
This gives us a list of 3 named data frames:
str(results)
List of 3
$ value1:'data.frame': 42 obs. of 7 variables:
..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
..$ end_year : num [1:42] 1979 1980 1981 1982 1983 ...
..$ Value : num [1:42] 0.0317 0.0323 0.0359 0.0362 0.0388 ...
..$ Std.Error : num [1:42] 0.0255 0.0255 0.0258 0.0258 0.0256 ...
..$ DF : num [1:42] 1449 1449 1449 1449 1449 ...
..$ t-value : num [1:42] 1.24 1.27 1.39 1.4 1.51 ...
..$ p-value : num [1:42] 0.215 0.206 0.164 0.161 0.13 ...
$ value2:'data.frame': 42 obs. of 7 variables:
..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
..$ end_year : num [1:42] 1979 1980 1981 1982 1983 ...
..$ Value : num [1:42] -0.00182 -0.00811 -0.01492 -0.0133 -0.00845 ...
..$ Std.Error : num [1:42] 0.0256 0.0258 0.0258 0.0257 0.0256 ...
..$ DF : num [1:42] 1449 1449 1449 1449 1449 ...
..$ t-value : num [1:42] -0.0713 -0.315 -0.5793 -0.5183 -0.33 ...
..$ p-value : num [1:42] 0.943 0.753 0.563 0.604 0.741 ...
$ value3:'data.frame': 42 obs. of 7 variables:
..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
..$ end_year : num [1:42] 1979 1980 1981 1982 1983 ...
..$ Value : num [1:42] -0.04 -0.0427 -0.0392 -0.03 -0.0351 ...
..$ Std.Error : num [1:42] 0.0254 0.0257 0.0257 0.0257 0.0256 ...
..$ DF : num [1:42] 1449 1449 1449 1449 1449 ...
..$ t-value : num [1:42] -1.58 -1.66 -1.53 -1.16 -1.37 ...
..$ p-value : num [1:42] 0.1154 0.0963 0.1274 0.2442 0.1702 ...
head(results$value1)
start_year end_year Value Std.Error DF t-value p-value
1 1950 1979 0.03166481 0.02550158 1449 1.241680 0.21455559
2 1951 1980 0.03232630 0.02554860 1449 1.265286 0.20597215
3 1952 1981 0.03589970 0.02579971 1449 1.391477 0.16429448
4 1953 1982 0.03615131 0.02576790 1449 1.402959 0.16084326
5 1954 1983 0.03876633 0.02562128 1449 1.513052 0.13048451
6 1955 1984 0.04888114 0.02555047 1449 1.913121 0.05592947
Note that I removed the coefficients for the intercepts using slice()
to keep it to one row per date range. If you want to keep those or do anything else with the models, you could nest the model outputs in a column an extract values from them as needed. The purrr
package has lots of good tools for this.