Let's say I have this kind of data, with value
observations grouped by three levels of size
:
df <- data.frame(
size = c(3,3,3,4,4,4,4,5,5,5,5,5),
position = c(1,2,3,1,2,3,4,1,2,3,4,5),
value = c(1,2,2.5,
-0.5,0.5,-0.5,1,
1.7,2,2.5,1.6,1.9)
)
I can draw polynomial regression lines for each size
level:
library(tidyverse)
df %>%
ggplot(aes(x = position, y = value, color = factor(size))) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
labs(x = "Index", y = "Value") +
theme_minimal()
resulting in this plot:
What I need in addition to the three separate lines is one line that, respecting the grouping, summarizes the three distributions; i.e., a regression line that shows how the value
variable distributes across the three levels. (How) can that be done?
I'm not sure how well-specified the problem is, but there are a couple of options to consider. One is to use a mixed effects model, where the best fitting polynomial is calculated for all the points, but the intercept is allowed to vary per group:
library(tidyverse)
library(lme4)
df <- df %>%
mutate(size = factor(size)) %>%
group_by(size) %>%
mutate(x = row_number())
mod <- lmer(value ~ poly(x, 2) + (1|size), df)
newdata <- expand.grid(x = seq(1, 5, 0.1),
size = factor("Fixed effect of index"))
newdata$value <- predict(mod, newdata, allow.new.levels = TRUE)
df %>%
ggplot(aes(x = x, y = value, color = size)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
geom_line(data = newdata, linetype = 2) +
labs(x = "Index", y = "Value") +
scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
theme_minimal()
Conceptually, this means that we are showing the single best-fitting polynomial that could be "slid" up and down the plot so it is the same shape for each group, but at a different height. This plot should illustrate what I mean:
newdata2 <- expand.grid(x = seq(1, 5, 0.1),
size = factor(c(3:5, "Fixed effect of index")))
newdata2$value <- predict(mod, newdata2, allow.new.levels = TRUE)
df %>%
ggplot(aes(x = x, y = value, color = size, linetype = size)) +
geom_point() +
geom_line(data = newdata2) +
labs(x = "Index", y = "Value") +
scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
scale_linetype_manual(values = c(1, 1, 1, 2)) +
theme_minimal()
A simpler approach of course, is to add a separate layer showing regression on all groups at once:
df %>%
ggplot(aes(x = x, y = value, color = size)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
aes(color = "All groups"), se = FALSE, linetype = 2) +
labs(x = "Index", y = "Value") +
scale_color_manual(values = c("red3", "blue3", "green4", "black")) +
theme_minimal()
It seems from the comments that this isn't exactly what you're looking for.
If neither of these are what you are after, please add a bit of clarification.
Created on 2023-03-21 with reprex v2.0.2