rlinear-regression

Backward Stepwise Regression for Fixed Effects in R


I'm trying to run a backward stepwise regression in R, but I might not know how to accomplish what I want.

This is an example of what I want to do: Using the Iris dataset, I have a simple regression with fixed effects for Species:

lm(Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species), data = iris)

I want to run something similar to a backward stepwise regression, but instead of removing variables, I want to remove categories (subsampling) from Species until my variables stop being significant.

How could I do that?


Solution

  • I achieved the results I wanted with a couple of loops, but I hoped for a more efficient or simpler solution:

    # Install and load necessary packages
    pacman::p_load(dplyr, broom, writexl, beepr)
    
    # Assuming iris is the dataframe 
    # Model: lm(Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species), data = iris)
    
    # Define a function to run the regression model and return the results
    run_model <- function(data, formula) {
      model <- lm(formula, data = data)
      # Extract model summary
      summary_model <- summary(model)
      # Extract coefficients, p-values, and other details
      coefficients <- summary_model$coefficients
      # Return a list with AIC and coefficients
      return(list(coefficients = coefficients))
    }
    
    # Store initial model results for the full model
    full_model_results <- run_model(iris, 
                                    Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species))
    
    # Initialize a list to store results
    results <- list(full_model = full_model_results)
    
    # Get the unique levels of Species
    species <- unique(iris$Species)
    
    # Loop to drop one cat at a time
    for (i in 0:(length(species)-1)) {
      # Subset data to exclude the current set of species
      subset_data <- dplyr::filter(iris, !Species %in% species[1:i])
      
      # Check if there are at least two levels of Species left
      if (length(unique(subset_data$Species)) < 2) {
        next
      }
      
      # Run the regression model on the subset data
      current_results <- run_model(subset_data, 
                                   Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species))
      
      # Store the results with a descriptive name
      results[[paste0("drop_", i, "_species")]] <- current_results
    }
    
    # Format and display the results in a more readable format
    formatted_results <- lapply(results, function(res) {
      list(
        Coefficients = as.data.frame(res$coefficients)
      )
    })