I am trying to better understand why stat smooth won't plot my polynomial regression line unless my x variable (independent variable) is assigned as a value outside of the plot first (e.g. x <- dataset$Salary)
dataset <- tibble(Level = 1:10,
Salary = c(45000, 50000, 60000, 80000, 110000, 150000, 200000, 300000, 500000, 1000000))
ggplot(data = dataset, aes(x = Level, y = Salary)) +
geom_point(color = "red") +
stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~
poly(dataset$Level, 3)) +
ggtitle("Truth or Bluff (Linear Regression)") +
xlab("Level ") +
ylab("Salary") +
theme(plot.title = element_text(hjust = 0.5))
x <- dataset$Level
ggplot(data = dataset, aes(x = Level, y = Salary)) +
geom_point(color = "red") +
stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~
poly(x, 3)) +
ggtitle("Truth or Bluff (Linear Regression)") +
xlab("Level ") +
ylab("Salary") +
theme(plot.title = element_text(hjust = 0.5))
x <- dataset$Salary is no different from dataset$Salary aside from being contained in a Value. My only thought is it has to do with how poly() views x, a numeric vector vs. how it views dataset$Salary as an extracted vector. I
Other than that I would expect the same result, but that is not the case.
I also tried renaming x to t and it does exactly what the first graph did, so I don't understand why x is so significant if its just the name of the Value.
t <- dataset$Level
ggplot(data = dataset, aes(x = Level, y = Salary)) +
geom_point(color = "red") +
stat_smooth(method = "lm", se = FALSE, formula = dataset$Salary ~
poly(t, 3)) +
ggtitle("Truth or Bluff (Linear Regression)") +
xlab("Level ") +
ylab("Salary") +
theme(plot.title = element_text(hjust = 0.5))
the formula
to stat_smooth
uses mapped aesthetics, i.e. x
and y
(as you have mapped x=Level, y=Salary
). If you had mapped colour=SomeVariable
you'd have to use colour
rather than SomeVariable
also.
so
stat_smooth(..., formula=y ~ poly(x, 3))
The reason you are getting the warning
In addition: Warning message:
'newdata' had 80 rows but variables found have 10 rows
is that your data dataset
has 10 rows. However stat_smooth
is getting the fitted Y values of the model over 80 X points in order to get a smooth looking line, so these lengths don't match up.
The reason you don't get the error when you use poly(x, 3)
in the formula is because this x
resolves to the x
of ggplot's constructed dataframe, rather than the global x
you defined.
Similarly the reason you do get the error with poly(t, 3)
is because t
is not in ggplot's constructed dataframe so the next t
on the search path is the global t
.