rggplot2linear-regressiondummy-variable

ggplot geom_smooth() for linear regression dummy variable - no regression lines


I apologize in advance for my first question on this platform. I had browsed many threads, but most of the things I found do not involve any dummy variables. However, after around 4 hours of investigating the issue, I cannot seem to figure out what's wrong.

For teaching and demonstration purposes, I am currently trying to setup a little plot of a dummy variable linear model using the well know iris dataset.

However, no matter what I try, I cannot get it to plot more than one regression line at once, whereas I would be expecting two: one from versicolor to setosa and one from virginica to setosa, with setosa being the first group of the factor.

Here's my code and what I have tried so far.

#Loading the tidyverse.

library(tidyverse)

#Loading the iris dataset and saving it as an object.

dataset <- as_tibble(iris)

#As a background: I'd like to visualize the following linear regression.

ols_lm_iris_all <- lm(Petal.Length ~ Species,
                      data = dataset)

summary(ols_lm_iris_all)

#This results in the following lm-model, which gives pairwise comparisons for: versicolor to setosa and virginica to setosa. So far so good.

#My code to visualize the data would be as follows.

iris_lm_plot_all <- ggplot(dataset, aes(x = Species, y = Petal.Length, colour = Species)) +
                    geom_smooth(method = lm, aes(group = Species)) +
                    geom_jitter(width = 0.2) +
                    labs(title = "Linear OLS regression with two regression lines for three types of Iris.")

iris_lm_plot_all

#What I get in the end, however, is a scatter plot with jitter but without any regression lines at all.

#Here's what I've also tried, partly successfully. If we assign "aes(group = 1)" instead of group = Species, we get one lm-line from the last factor virginica to setosa. That is half the work, but now we don't get versicolor to setosa.

iris_lm_plot_all <- ggplot(dataset, aes(x = Species, y = Petal.Length, colour = Species)) +
                    geom_smooth(method = lm, aes(group = 1)) +
                    geom_jitter(width = 0.2) +
                    labs(title = "Linear OLS regression with two regression lines for three types of Iris.")

What I also thought about: Could it be that geom_smooth does not handle the dummification of Species properly?


Solution

  • The final plot isn't doing quite what you'd hoped. Species is being converted to a numeric, and then the lm is fitted to all three factors. It just happens in this example that it goes through the first and last cluster nicely, but if the factors are re-ordered that doesn't happen.

    You can manually edit the data for two comparisons/three factors, but that doesn't easily generalise:

    dataset <- as.tibble(iris)
    
    dataset <- rbind(
      cbind(subset(dataset, as.numeric(Species) %in% c(1,2)),comparison="A"),
      cbind(subset(dataset, as.numeric(Species) %in% c(1,3)),comparison="B")
    )
    
    ggplot(dataset, aes(x = Species, y = Petal.Length, colour = Species)) +
        geom_smooth(method = lm, aes(group=comparison)) +
      geom_jitter(width = 0.2, ) +
      labs(title = "Linear OLS regression mit with two regression line for three types of Iris.")
    

    Output plot

    This question has other approaches to plotting pairwise comparisons.