I am quite new with r so bear with me please. I am plotting a daily avg temperature across a four month period. I want to be able to visualize the confidence I have across the data. I believe the best way to do this would be to visualize the confidence intervals across the graph. I basically just want a visualization of the spread of the data in a format similar to this one I saw (PLOT 2).
I used code from a prior stack overflow question similar to mine. I'm not fully understanding what the code is doing/performing so I am very likely wrong somewhere and just need some general direction in how to do this. Here is the code I am using.
library(ggplot2)
library(stats)
library(dplyr)
(ggplot(DailyAvgL5, aes(Date, mean_temp)) +
stat_summary(
geom = 'smooth',
fun.data = mean_cl_normal,
fun.args = list(conf.int = 0.95),
group = 1,
alpha = 0.5,
color = 'black',
se = TRUE)
)
The resulting graph is just the line graph, showing no visualization of spread.
Here is my data I am using for this graph
dput(head(DailyAvgL5))
structure(list(Date = structure(c(19791, 19792, 19793, 19794,
19795, 19796), class = "Date"), mean_temp = c(9.98765502929687,
9.884833984375, 8.01781209309896, 8.70198394775391, 9.21991678873698,
9.69807739257812), z_scoremean_temp = c(-1.34020216363965, -1.36818008165322,
-1.87620239793434, -1.69003716229342, -1.54910606620731, -1.41899711641255
), overallmean = c(14.9037835315265, 14.9037835315265, 14.9037835315265,
14.9037835315265, 14.9037835315265, 14.9037835315265)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I think confidence intervals are what I am wanting to use. I initially tried to visualize the z scores but couldn't figure out a way either. Not sure what the best direction would be for visualizing the spread / confidence on this data. Thanks for the help!
The example plot you are showing is summarising one of the columns of the built-in iris
data set, but this has a fundamentally different structure to your own data which means you cannot apply the same code to get a similar output.
The example plot summarizes 50 sepal length measurements for each species of iris on the x axis. If we have several measurements at each point on our x axis, we can calculate the mean and the 95% confidence interval for the mean directly at each point on the x axis using stat_summary
.
However, you have a single temperature measurement for each date on the x axis, and we can't generate a confidence interval from a single measurement.
Instead, we need to perform a regression to find the moving average of our data across the x axis. This will be a trendline around which the actual temperatures vary. The regression will also give us a continuous confidence interval for this moving average, which is probably the closest we can get to your stated goal.
The easiest way to do this in ggplot
is to use geom_smooth
. Its default settings will generate a local polynomial regression.
You have only provided 6 data points, which is not quite enough data to generate a local polynomial regression, so I have added a few plausible rows of data to demonstrate (see the bottom of the answer for the data used).
The basic call would be:
library(ggplot2)
ggplot(DailyAvgL5, aes(Date, mean_temp)) +
geom_smooth()
This shows the moving average in blue with the 95% confidence interval for the moving average in the gray band. If you only want to see your raw data and the gray band behind it, you can instead do:
ggplot(DailyAvgL5, aes(Date, mean_temp)) +
geom_smooth(linetype = 0) +
geom_line()
A more typical way to present this data would be with the raw data as points and the regression line and its confidence interval included behind the data:
ggplot(DailyAvgL5, aes(Date, mean_temp)) +
geom_smooth() +
geom_point()
Data used
DailyAvgL5 <- structure(list(Date = structure(c(19791, 19792, 19793, 19794,
19795, 19796, 19797, 19798, 19799, 19800, 19801, 19802, 19803,
19804, 19805, 19806, 19807, 19808, 19809, 19810, 19811, 19812,
19813, 19814, 19815, 19816), class = "Date"), mean_temp = c(9.98765502929687,
9.884833984375, 8.01781209309896, 8.70198394775391, 9.21991678873698,
9.69807739257812, 9.25, 10.28, 10.56, 10.11, 11.15, 10.73, 11.21,
11.05, 11.38, 11.51, 10.74, 11.77, 11.02, 11.58, 11.82, 11.53,
13.22, 11.24, 12.34, 12.69), z_scoremean_temp = c(-1.34020216363965,
-1.36818008165322, -1.87620239793434, -1.69003716229342, -1.54910606620731,
-1.41899711641255, -1.44, -1.4, -1.36, -1.32, -1.27, -1.23, -1.19,
-1.15, -1.11, -1.07, -1.03, -0.99, -0.95, -0.91, -0.87, -0.83,
-0.79, -0.75, -0.71, -0.67), overallmean = c(14.9037835315265,
14.9037835315265, 14.9037835315265, 14.9037835315265, 14.9037835315265,
14.9037835315265, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9,
14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9, 14.9,
14.9)), row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"
))