rggplot2tidyverselinear-regressionbroom

Extract pValue of slopes from a linear model using the tidyverse and add it to a plot


This relates to a previously asked question about fitting a linear or sloped model of gene expression data. I have expression values of various genes during 5 consecutive development stages and want to create a linear model to test of the genes increase, decrease or remain stable during development. As I have fairly few data points, a linear model would be most appropriate. I fund a way to extract some the data from the linear model but can not get the pValue of the slope that would tell me the chance that the slope is actually horizontal (so no change during development. Below, find example data and what i managed to get.

The data from the model only give me the intercept p-value but not the pvalue of the slope.

As a bonus, it would be great to integrate the slope p-value for each gene in the plot.

# packages
library('tidyverse') # ggplot & dplyr
library('magrittr') # pipe operations
library(broom) # tidy models

## generate data
set.seed(123)
num_genes <- 5
num_groups <- 5
exp <- data.frame()
for (gene_id in 1:num_genes) {
  gene_name <- paste("Gene", gene_id, sep = "_")
  for (group_id in 1:num_groups) {
    group_name <- paste("Stage", group_id, sep = "_")
    expression_values <- rnorm(10, mean = 10, sd = 2)  # Change 10 to your desired sample size
    group_data <- data.frame(Gene = gene_name, Group = group_name, Expression = expression_values)
    exp <- rbind(exp, group_data)
  }
}
exp$Group <- factor(exp$Group, c('Stage_1',  'Stage_2', 'Stage_3', 'Stage_4', 'Stage_5'))
head(exp)


# get data from linear model: missing: pvalue of the slope
dat.lm <- exp %>%
  group_by(Gene) %>%
  group_modify(~ broom::tidy(lm(Expression ~ Group, data = .x)))
head(dat.lm)

# plot
exp %>%
  ggplot(aes(x=Group, y=Expression, color = Gene, group = Gene)) +
  geom_smooth(method = lm, se = T, alpha = 0.1, aes(fill = Gene))

Solution

  • There are a few things to note here. geom_smooth(method = "lm") does not produce the same output as broom::tidy(lm()). The fit of the lm model estimates a coefficient for every Group / Stage. So there really is no slope in here but just coefficients which increase or decrease the expression from the intercept. geom_smooth() treats your factor variable Group like a continuous variable and is able to estimate a slope.

    Therefore, in order to get the results of your model fitting to be in line with geom_smooth() and being able to calculate the slope with the respective p-value we need to convert the factor to numeric first.

    Then we can add the estimates and p-values via tableGrob() from the gridExtra package.

    library(tidyverse) 
    library(broom) 
    library(gridExtra)
    
    ## generate data
    set.seed(123)
    num_genes <- 5
    num_groups <- 5
    exp <- data.frame()
    for (gene_id in 1:num_genes) {
      gene_name <- paste("Gene", gene_id, sep = "_")
      for (group_id in 1:num_groups) {
        group_name <- paste("Stage", group_id, sep = "_")
        expression_values <- rnorm(10, mean = 10, sd = 2)  
        group_data <- data.frame(Gene = gene_name, Group = group_name, Expression = expression_values)
        exp <- rbind(exp, group_data)
      }
    }
    exp$Group <- factor(exp$Group, c('Stage_1',  'Stage_2', 'Stage_3', 'Stage_4', 'Stage_5'))
    
    exp$Group <- as.numeric(exp$Group)
    
    dat.lm <- exp %>%
      group_by(Gene) %>%
      group_modify(~ broom::tidy(lm(Expression ~ Group, data = .x)))
    
    exp %>%
      ggplot(aes(x = Group, y = Expression, color = Gene, group = Gene)) +
      geom_smooth(method = "lm", se = T, alpha = 0.1, aes(fill = Gene)) + 
      annotation_custom(tableGrob(dat.lm[,c(1,2,3,6)],
                                  rows = NULL,
                                  theme = ttheme_minimal(base_size = 6)),
                        xmin = 1.0, xmax = 2, ymin = 10.5, ymax = 11.3)
    

    results in:

    enter image description here