rfor-loopmixed-modelsnlme

Loop to create multiple lme models, where each column is a different explanatory variable, and also loop through years (rolling window)


In the outer loop, I create a formula for the fixed effects of my model, looping through columns (each column is an independant variable).

In the inner loop, I select a subset of years on which to run my model, then run the model, then store the results (coefficient, standard error, p value) in a dataframe. Each new iteration of the model is combined to the previous with rbind. The rolling window is 30 years long and advances 1 year at each iteration, giving about 40 different models for each independant variable.

1- Is there a more efficient way to do this? 2- I'm still unsure how to store the final result (before closing the outer loop) 3- Also I get this error when running the lme in the inner loop: Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars

this is my code:

    treeIDs <- factor(1:50)
    years <- 1950:2020

    data <- data.frame(
      year = rep(years, times = length(treeIDs)),    
      treeID = rep(treeIDs, each = length(years)),   # group (for random effect)
      rwi = rnorm(length(treeIDs) * length(years)),  # dependant var
      value1 = rnorm(length(treeIDs) * length(years)),  # independant var
      value2 = rnorm(length(treeIDs) * length(years)),  
      value3 = rnorm(length(treeIDs) * length(years))   
    )

    window_size <- 30
    step_size <- 1

# empty dataframe to store results
     final_dataframe <- data.frame(start_year = integer(),
                               end_year = integer(),
                               estimate = numeric(),
                               se = numeric(),
                               p_value = numeric() 
    )

    ind_variables <- data[c("value1","value2","value3")]

# outer loop
    for (col_name in ind_variables) {
  
     model_formula <- formula(paste("rwi ~", scale(col_name)))

# inner loop
    for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
  
  # Define the end year of the window
    end_year <- start_year + window_size - 1
  
  # Subset data for the current window
    subset_data <- data %>%
      filter(year >= start_year & year <= end_year)
 
  # Fit model
    model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
               na.action = na.exclude, method = "ML")
  
  # Extract info from the regression summary
    estimate <- summary(model)$tTable[, c("Value")]
    se <- summary(model)$tTable[, c("Std.Error")]
    p_value <- summary(model)$tTable[,c("p-value")]  
  
  # Store the results 
    result <- data.frame(start_year = start_year,
                       end_year = end_year,
                       estimate = estimate,
                       se = se,
                       p_value = p_value)
  
  
  # combine result to final_dataframe
    final_dataframe <- rbind(final_dataframe, result)
    }

# how to make a list with the 3 final_dataframes ???
    }

I would like the final output to be a list of dataframes (one dataframe for each independant variable)


Solution

  • I'll try to answer your questions in reverse order.

    library(dplyr)
    library(nlme)
    
    treeIDs <- factor(1:50)
    years <- 1950:2020
    
    data <- data.frame(
      year = rep(years, times = length(treeIDs)),    
      treeID = rep(treeIDs, each = length(years)),
      rwi = rnorm(length(treeIDs) * length(years)),
      # Scaling these values up front
      value1 = scale(rnorm(length(treeIDs) * length(years))),
      value2 = scale(rnorm(length(treeIDs) * length(years))),  
      value3 = scale(rnorm(length(treeIDs) * length(years)))   
    )
    
    window_size <- 30
    step_size <- 1
    
    # Just using the names of the independent variables
    ind_variables <- c("value1","value2","value3")
    
    # Use lapply to give a list as an output with one element for each ind_var
    results <- lapply(ind_variables, \(col_name) {
      
      # Initialize empty data frame for each independent variable. Note that this is 
      # inside of the lapply function, so each new ind_var gets its own data frame.
      var_df <- data.frame()
      
      # Make formula using character string name of each independent variable
      model_formula <- formula(paste("rwi ~", col_name))
      
      # inner loop
      for (start_year in seq(1950, 2020 - window_size + 1, by = step_size)) {
        
        # Define the end year of the window
        end_year <- start_year + window_size - 1
        
        # Subset data for the current window
        subset_data <- data %>%
          filter(year >= start_year & year <= end_year)
        
        # Fit model
        model <- lme(fixed = model_formula, random = ~1|treeID, data = subset_data,
                     na.action = na.exclude, method = "ML")
    
        # Pull out the model coefficients into a data frame.
        coefs <- summary(model)$tTable %>% 
          as.data.frame() %>% 
          slice(2) # Note - only keeping coefs for the var, not the intercept
        
        # Remove rownames from coefs
        rownames(coefs) <- NULL
        
        # Put start year, end year, and ind_var coefs together
        row <- cbind(start_year, end_year, coefs)
        
        # Add the results for this year into the df for the ind_var
        var_df <- rbind(var_df, row)
      }
      
      # Return the full DF for each ind_var
      return(var_df)
      
    }) %>% 
      
      # Rename the list with the ind_variables names
      setNames(ind_variables)
    

    This gives us a list of 3 named data frames:

    str(results)
    List of 3
     $ value1:'data.frame': 42 obs. of  7 variables:
      ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
      ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
      ..$ Value     : num [1:42] 0.0317 0.0323 0.0359 0.0362 0.0388 ...
      ..$ Std.Error : num [1:42] 0.0255 0.0255 0.0258 0.0258 0.0256 ...
      ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
      ..$ t-value   : num [1:42] 1.24 1.27 1.39 1.4 1.51 ...
      ..$ p-value   : num [1:42] 0.215 0.206 0.164 0.161 0.13 ...
     $ value2:'data.frame': 42 obs. of  7 variables:
      ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
      ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
      ..$ Value     : num [1:42] -0.00182 -0.00811 -0.01492 -0.0133 -0.00845 ...
      ..$ Std.Error : num [1:42] 0.0256 0.0258 0.0258 0.0257 0.0256 ...
      ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
      ..$ t-value   : num [1:42] -0.0713 -0.315 -0.5793 -0.5183 -0.33 ...
      ..$ p-value   : num [1:42] 0.943 0.753 0.563 0.604 0.741 ...
     $ value3:'data.frame': 42 obs. of  7 variables:
      ..$ start_year: num [1:42] 1950 1951 1952 1953 1954 ...
      ..$ end_year  : num [1:42] 1979 1980 1981 1982 1983 ...
      ..$ Value     : num [1:42] -0.04 -0.0427 -0.0392 -0.03 -0.0351 ...
      ..$ Std.Error : num [1:42] 0.0254 0.0257 0.0257 0.0257 0.0256 ...
      ..$ DF        : num [1:42] 1449 1449 1449 1449 1449 ...
      ..$ t-value   : num [1:42] -1.58 -1.66 -1.53 -1.16 -1.37 ...
      ..$ p-value   : num [1:42] 0.1154 0.0963 0.1274 0.2442 0.1702 ...
    
    head(results$value1)
      start_year end_year      Value  Std.Error   DF  t-value    p-value
    1       1950     1979 0.03166481 0.02550158 1449 1.241680 0.21455559
    2       1951     1980 0.03232630 0.02554860 1449 1.265286 0.20597215
    3       1952     1981 0.03589970 0.02579971 1449 1.391477 0.16429448
    4       1953     1982 0.03615131 0.02576790 1449 1.402959 0.16084326
    5       1954     1983 0.03876633 0.02562128 1449 1.513052 0.13048451
    6       1955     1984 0.04888114 0.02555047 1449 1.913121 0.05592947
    

    Note that I removed the coefficients for the intercepts using slice() to keep it to one row per date range. If you want to keep those or do anything else with the models, you could nest the model outputs in a column an extract values from them as needed. The purrr package has lots of good tools for this.