Loop to create multiple lme models, where each column is a different explanatory variable, and also loop through years (rolling window)

In the outer loop, I create a formula for the fixed effects of my model, looping through columns (each column is an independant variable).

In the inner loop, I select a subset of years on which to run my model, then run the model, then store the results (coefficient, standard error, p value) in a dataframe. Each new iteration of the model is combined to the previous with rbind. The rolling window is 30 years long and advances 1 year at each iteration, giving about 40 different models for each independant variable.

1- Is there a more efficient way to do this? 2- I'm still unsure how to store the final result (before closing the outer loop) 3- Also I get this error when running the lme in the inner loop: Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars

this is my code:

    treeIDs <- factor(1:50)
    years <- 1950:2020

    data <- data.frame(
      year = rep(years, times = length(treeIDs)),    
      treeID = rep(treeIDs, each = length(years)),   # group (for random effect)
      rwi = rnorm(length(treeIDs) * length(years)),  # dependant var
      value1 = rnorm(length(treeIDs) * length(years)),  # independant var
      value2 = rnorm(length(treeIDs) * length(years)),  
      value3 = rnorm(length(treeIDs) * length(years))   
    )

    window_size <- 30
    step_size <- 1

# empty dataframe to store results
     final_dataframe <- data.frame(start_year = integer(),
                               end_year = integer(),
                               estimate = numeric(),
                               se = numeric(),
                               p_value = numeric() 
    )

    ind_variables <- data[c("value1","value2","value3")]

# outer loop
    for (col_name in ind_variables) {
  
     model_formula <- formula(paste("rwi ~", scale(col_name)))

# inner loop
    for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
  
  # Define the end year of the window
    end_year <- start_year + window_size - 1
  
  # Subset data for the current window
    subset_data <- data %>%
      filter(year >= start_year & year <= end_year)
 
  # Fit model
    model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
               na.action = na.exclude, method = "ML")
  
  # Extract info from the regression summary
    estimate <- summary(model)$tTable[, c("Value")]
    se <- summary(model)$tTable[, c("Std.Error")]
    p_value <- summary(model)$tTable[,c("p-value")]  
  
  # Store the results 
    result <- data.frame(start_year = start_year,
                       end_year = end_year,
                       estimate = estimate,
                       se = se,
                       p_value = p_value)
  
  
  # combine result to final_dataframe
    final_dataframe <- rbind(final_dataframe, result)
    }

# how to make a list with the 3 final_dataframes ???
    }

I would like the final output to be a list of dataframes (one dataframe for each independant variable)

Solution

I'll try to answer your questions in reverse order.

The formula() function creates a character string, so it will not recognize the scale function or col_name object. I did the scaling up front when defining the data frame so that we can make a clean formula.
We can use lapply() to get an output as a list. In each inner loop, we can return the data frame for that independent variable. The apply function will then give a result as a list of data frames.
There is usually a better way! But it seems to work well enough. I tried to save a few lines by pulling the whole summary(model)$tTable instead of naming variables individually.

library(dplyr)
library(nlme)

treeIDs <- factor(1:50)
years <- 1950:2020

data <- data.frame(
  year = rep(years, times = length(treeIDs)),    
  treeID = rep(treeIDs, each = length(years)),
  rwi = rnorm(length(treeIDs) * length(years)),
  # Scaling these values up front
  value1 = scale(rnorm(length(treeIDs) * length(years))),
  value2 = scale(rnorm(length(treeIDs) * length(years))),  
  value3 = scale(rnorm(length(treeIDs) * length(years)))   
)

window_size <- 30
step_size <- 1

# Just using the names of the independent variables
ind_variables <- c("value1","value2","value3")

# Use lapply to give a list as an output with one element for each ind_var
results <- lapply(ind_variables, \(col_name) {
  
  # Initialize empty data frame for each independent variable. Note that this is 
  # inside of the lapply function, so each new ind_var gets its own data frame.
  var_df <- data.frame()
  
  # Make formula using character string name of each independent variable
  model_formula <- formula(paste("rwi ~", col_name))
  
  # inner loop
  for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
    
    # Define the end year of the window
    end_year <- start_year + window_size - 1
    
    # Subset data for the current window
    subset_data <- data %>%
      filter(year >= start_year & year <= end_year)
    
    # Fit model
    model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
                 na.action = na.exclude, method = "ML")

    # Pull out the model coefficients into a data frame.
    coefs <- summary(model)$tTable %>% 
      as.data.frame() %>% 
      slice(2) # Note - only keeping coefs for the var, not the intercept
    
    # Remove rownames from coefs
    rownames(coefs) <- NULL
    
    # Put start year, end year, and ind_var coefs together
    row <- cbind(start_year, end_year, coefs)
    
    # Add the results for this year into the df for the ind_var
    var_df <- rbind(var_df, row)
  }
  
  # Return the full DF for each ind_var
  return(var_df)
  
}) %>% 
  
  # Rename the list with the ind_variables names
  setNames(ind_variables)

This gives us a list of 3 named data frames:

str(results)
List of 3
 $ value1:'data.frame': 42 obs. of  7 variables:
  ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
  ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
  ..$ Value     : num [1:42] 0.0317 0.0323 0.0359 0.0362 0.0388 ...
  ..$ Std.Error : num [1:42] 0.0255 0.0255 0.0258 0.0258 0.0256 ...
  ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
  ..$ t-value   : num [1:42] 1.24 1.27 1.39 1.4 1.51 ...
  ..$ p-value   : num [1:42] 0.215 0.206 0.164 0.161 0.13 ...
 $ value2:'data.frame': 42 obs. of  7 variables:
  ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
  ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
  ..$ Value     : num [1:42] -0.00182 -0.00811 -0.01492 -0.0133 -0.00845 ...
  ..$ Std.Error : num [1:42] 0.0256 0.0258 0.0258 0.0257 0.0256 ...
  ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
  ..$ t-value   : num [1:42] -0.0713 -0.315 -0.5793 -0.5183 -0.33 ...
  ..$ p-value   : num [1:42] 0.943 0.753 0.563 0.604 0.741 ...
 $ value3:'data.frame': 42 obs. of  7 variables:
  ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
  ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
  ..$ Value     : num [1:42] -0.04 -0.0427 -0.0392 -0.03 -0.0351 ...
  ..$ Std.Error : num [1:42] 0.0254 0.0257 0.0257 0.0257 0.0256 ...
  ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
  ..$ t-value   : num [1:42] -1.58 -1.66 -1.53 -1.16 -1.37 ...
  ..$ p-value   : num [1:42] 0.1154 0.0963 0.1274 0.2442 0.1702 ...

head(results$value1)
  start_year end_year      Value  Std.Error   DF  t-value    p-value
1       1950     1979 0.03166481 0.02550158 1449 1.241680 0.21455559
2       1951     1980 0.03232630 0.02554860 1449 1.265286 0.20597215
3       1952     1981 0.03589970 0.02579971 1449 1.391477 0.16429448
4       1953     1982 0.03615131 0.02576790 1449 1.402959 0.16084326
5       1954     1983 0.03876633 0.02562128 1449 1.513052 0.13048451
6       1955     1984 0.04888114 0.02555047 1449 1.913121 0.05592947

Note that I removed the coefficients for the intercepts using slice() to keep it to one row per date range. If you want to keep those or do anything else with the models, you could nest the model outputs in a column an extract values from them as needed. The purrr package has lots of good tools for this.