For example, I have data like below.
genotype=rep(c("A","B","C","D","E"), each=5)
env=rep(c(10,20,30,40,50), time=5)
outcome=c(10,15,17,19,22,12,13,15,18,25,10,11,12,13,18,11,15,20,22,28,10,9,10,12,15)
dataA=data.frame(genotype,env,outcome)
Then, I would like to fit between outcome and env per genotype, and also want to calculate RMSE. So, I used this code.
A=anova(lm(outcome~env,data=subset(dataA, genotype=="A")))
##
Response: outcome
Df Sum Sq Mean Sq F value Pr(>F)
env 1 78.4 78.400 84 0.002746 **
Residuals 3 2.8 0.933
##
A_rmse=sqrt(0.933)
A_rmse= 0.9659193
I need to calculate B and until E genotype in the same way, but in my actual data, genotype is more than 100, so it would be impossible to calculate one by one. So I'd like to know how to automatically calculate RMSE (=square root of MSE in anova table) per each genotype.
Could you let me know how to do it?
Always many thanks,
One approach is to use nest
the data by genotype. This allows you to create a data frame that contains the results of anova
in a list column. You can then use broom::tidy
to extract the Mean sq
values and calculate the RMSE.
The basis for this is the excellent tutorial Running a model on separate groups.
First, install the packages if required, then load:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
Here's what nesting the data by genotype looks like:
dataA %>%
nest(data = -genotype)
# A tibble: 5 × 2
genotype data
<chr> <list>
1 A <tibble [5 × 2]>
2 B <tibble [5 × 2]>
3 C <tibble [5 × 2]>
4 D <tibble [5 × 2]>
5 E <tibble [5 × 2]>
Now we can use mutate
and map
to add columns with the anova
results, and the metrics extracted by tidy
:
dataA %>%
nest(data = -genotype) %>%
mutate(A = map(data, ~anova(lm(outcome~env, data = .))),
results = map(A, tidy))
# A tibble: 5 × 4
genotype data A results
<chr> <list> <list> <list>
1 A <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
2 B <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
3 C <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
4 D <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
5 E <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
Now we unnest
the results
column and filter
for the residuals term:
dataA %>%
nest(data = -genotype) %>%
mutate(A = map(data, ~anova(lm(outcome~env, data = .))),
results = map(A, tidy)) %>%
unnest(results) %>%
filter(term == "Residuals")
# A tibble: 5 × 9
genotype data A term df sumsq meansq statistic p.value
<chr> <list> <list> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 A <tibble [5 × 2]> <anova [2 × 5]> Residuals 3 2.80 0.933 NA NA
2 B <tibble [5 × 2]> <anova [2 × 5]> Residuals 3 13.1 4.37 NA NA
3 C <tibble [5 × 2]> <anova [2 × 5]> Residuals 3 6.4 2.13 NA NA
4 D <tibble [5 × 2]> <anova [2 × 5]> Residuals 3 2.7 0.9 NA NA
5 E <tibble [5 × 2]> <anova [2 × 5]> Residuals 3 5.9 1.97 NA NA
And finally, calculate RMSE using sqrt(meansq)
, then select the desired columns.
So the entire process looks like this:
dataA %>%
nest(data = -genotype) %>%
mutate(A = map(data, ~anova(lm(outcome~env, data = .))),
results = map(A, tidy)) %>%
unnest(results) %>%
filter(term == "Residuals") %>%
mutate(rmse = sqrt(meansq)) %>%
select(genotype, rmse)
# A tibble: 5 × 2
genotype rmse
<chr> <dbl>
1 A 0.966
2 B 2.09
3 C 1.46
4 D 0.949
5 E 1.40