I have a dataset with very limited data points.
x<- c(4, 8, 13, 24)
y<- c(40, 37, 28, 20)
df<- data.frame(x,y)
Now I want to extrapolate this data, creating a dataset where the value of y will be given for every value (no decimals) of x between 1-100. x and y have a linear relationship.
Secondly, could this be done for multiple dataframes by using something like a loop? Thank you!
This is a short snippet that does this:
linear_xy <- lm(y ~ x, data = df)
# df <- broom:::augment.lm(linear_xy, newdata = complete(df, x = 1:100)) # one way
df <- df %>% # another way
complete(x = 1:100) %>%
mutate(.fitted = predict(linear_xy, newdata = .))
ggplot(df, aes(x, y)) +
geom_line(aes(y = .fitted)) +
geom_point() +
ggpubr::theme_pubr()
This requires that you have the packages {tidyverse}
, {broom}
, and {ggpubr}
installed.
Assumming we want to do this with multiple data-frames, we have to restructure things a bit.
x <- c(4, 8, 13, 24)
y <- c(40, 37, 28, 20)
df <- tibble(x, y)
I don't have multiple data-frames (or tibbles), so I'll make this the
primary one, and make up a function (a factory) that yields data-frames, that are a bit different from the above df
.
df_factory <- . %>%
mutate(x_new = x + sample.int(100, size = n()),
x = if_else(x_new >= 100, x, x_new),
y_new = y + rnorm(n(), mean = median(y), sd = sd(y)),
y = y_new,
y_new = NULL,
x_new = NULL)
Thus df_factory
is a function of one-variable, and that must be a
data-frame that has an x
and y
;
df1 <- df_factory(df)
df2 <- df_factory(df)
df3 <- df_factory(df)
all_dfs <- list(df1, df2, df3)
all_dfs <- bind_rows(all_dfs, .id = "df_id")
Here we ensure that the relation to the original data-frame is preserved in the all_dfs
data-frame via the new variable df_id
.
Next we want to:
data
.rowwise
) we have to perform:
linear_xy
(which are also stored in a list-column`).ggplot
as one contiguous data-frame.all_dfs %>%
nest(data = c(x,y)) %>%
rowwise() %>%
mutate(linear_xy = list(lm(y ~ x, data = data)),
augment = list(broom:::augment.lm(linear_xy,
newdata = complete(data, x = 1:100)))) %>%
ungroup() %>%
select(-data, -linear_xy) %>%
unnest(augment) ->
all_dfs_predictions
Note: ->
at the end shows what the pipe result is now assigned to.
The group
informs ggplot
to treat the rows as separate via their
df_id
. And for fun we add the color and fill to also depend on df_id
. In fact I could have choosen something else to be the color
aesthetics dependent, like "original df
" vs. "others" or if a certain threshold should distinguish them, etc.. But then the group
aesthetic would still tell ggplot
to separate the rows amongst this relation.
ggplot(all_dfs_predictions, aes(x, y, group = df_id, color = df_id, fill = df_id)) +
geom_line(aes(y = .fitted)) +
geom_point() +
lims(x = c(1,100)) +
ggpubr::theme_pubr()