rggplot2non-linear-regression

Draw polynomial regression line that summarizes over grouped distributions


Let's say I have this kind of data, with value observations grouped by three levels of size:

df <- data.frame(
  size = c(3,3,3,4,4,4,4,5,5,5,5,5),
  position = c(1,2,3,1,2,3,4,1,2,3,4,5),
  value = c(1,2,2.5,
            -0.5,0.5,-0.5,1,
            1.7,2,2.5,1.6,1.9)
)

I can draw polynomial regression lines for each size level:

library(tidyverse)
df %>%
  ggplot(aes(x = position, y = value, color = factor(size))) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
  labs(x = "Index", y = "Value") +
  theme_minimal()

resulting in this plot:

enter image description here

What I need in addition to the three separate lines is one line that, respecting the grouping, summarizes the three distributions; i.e., a regression line that shows how the value variable distributes across the three levels. (How) can that be done?


Solution

  • I'm not sure how well-specified the problem is, but there are a couple of options to consider. One is to use a mixed effects model, where the best fitting polynomial is calculated for all the points, but the intercept is allowed to vary per group:

    library(tidyverse)
    library(lme4)
    
    df <- df %>% 
      mutate(size = factor(size)) %>%
      group_by(size) %>% 
      mutate(x = row_number())
    
    mod <- lmer(value ~ poly(x, 2) + (1|size), df)
    newdata <- expand.grid(x = seq(1, 5, 0.1), 
                           size = factor("Fixed effect of index"))
    newdata$value <- predict(mod, newdata, allow.new.levels = TRUE)
    
    df %>%
      ggplot(aes(x = x, y = value, color = size)) +
      geom_point() +
      geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
      geom_line(data = newdata, linetype = 2) + 
      labs(x = "Index", y = "Value") +
      scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
      theme_minimal()
    

    enter image description here

    Conceptually, this means that we are showing the single best-fitting polynomial that could be "slid" up and down the plot so it is the same shape for each group, but at a different height. This plot should illustrate what I mean:

    newdata2 <- expand.grid(x = seq(1, 5, 0.1), 
                            size = factor(c(3:5, "Fixed effect of index")))
    newdata2$value <- predict(mod, newdata2, allow.new.levels = TRUE)
    
    df %>%
      ggplot(aes(x = x, y = value, color = size, linetype = size)) +
      geom_point() +
      geom_line(data = newdata2) + 
      labs(x = "Index", y = "Value") +
      scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
      scale_linetype_manual(values = c(1, 1, 1, 2)) +
      theme_minimal()
    

    enter image description here

    A simpler approach of course, is to add a separate layer showing regression on all groups at once:

    df %>%
      ggplot(aes(x = x, y = value, color = size)) +
      geom_point() +
      geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) + 
      geom_smooth(method = "lm", formula = y ~ poly(x, 2), 
                  aes(color = "All groups"), se = FALSE, linetype = 2) + 
      labs(x = "Index", y = "Value") +
      scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
      theme_minimal()
    

    enter image description here

    It seems from the comments that this isn't exactly what you're looking for.

    If neither of these are what you are after, please add a bit of clarification.

    Created on 2023-03-21 with reprex v2.0.2