rggplot2geom

In below `geom_smooth` how to make line fluctuate match with original data


In below geom_smooth, the line year 2023 is smoother than year 2024, but the 2023 amount SD is 20 lager then 2024 15. How to fix it?

library(tidyverse)
df_2023 <- data.frame(mdate =seq.Date(from=as.Date('2023-1-1'),
                                      to=as.Date('2023-12-31'),by="1 day"),
                      amount = rnorm(365,mean=4,sd=20),
                      myear='2023')

df_2024 <- data.frame(mdate = seq.Date(from=as.Date('2024-1-1'),
                                       to=as.Date('2024-6-28'),by="1 day"),
                      amount= rnorm(180,mean=4,sd=15),
                      myear='2024')

plot_data <- rbind(df_2023,df_2024)

plot_data %>% mutate(mdate_new = update(mdate,year=2024)) %>% 
  ggplot(aes(x = mdate_new,y=amount,color=myear )) + geom_line(aes(alpha=0.6))+
  geom_smooth(se=FALSE)

Maybe a smooth line of 2023 generated by whole year data, so more smooth. I changed the above geom_smoooth to the below code, but failed

geom_smooth(aes(data= plot_data %>% filter(mdate_new <as.Date('2024-6-28'))))

enter image description here


Solution

  • stat_smooth tells you that loess is used and it is used with the default parameters, specifically with span = 0.75. The documentation explains that for smoothing a neighbourhood including a proportion of data points is used and that proportion is defined by span, i.e., by default a neighbourhood with 75 % of the points is used.

    Now, in the data subsets you have very different total numbers of data points, which means the default neighbourhood has very different numbers of points, which results in different smoothing. You can correct that:

    n24 <- nrow(subset(plot_data, myear == 2024))
    n23 <- nrow(subset(plot_data, myear == 2023))
      
    ggplot(plot_data, aes(x = mdate_new,y=amount,color=myear )) + geom_line(aes(alpha=0.6))+
      geom_smooth(data = subset(plot_data, myear == 2023), se=FALSE, span = 0.75 * n24/n23) +
      geom_smooth(data = subset(plot_data, myear == 2024), se=FALSE, span = 0.75)
    

    I don't show the output here because you didn't set a random seed and thus your data isn't fully reproducible.

    PS: It might be preferable to fit a mgcv::gam model outside ggplot2. That gives you much more fine control.