How to automatically calculate RMSE per group for fitting data in R?

For example, I have data like below.

genotype=rep(c("A","B","C","D","E"), each=5)
env=rep(c(10,20,30,40,50), time=5)
outcome=c(10,15,17,19,22,12,13,15,18,25,10,11,12,13,18,11,15,20,22,28,10,9,10,12,15)
dataA=data.frame(genotype,env,outcome)

Then, I would like to fit between outcome and env per genotype, and also want to calculate RMSE. So, I used this code.

A=anova(lm(outcome~env,data=subset(dataA, genotype=="A")))

##
Response: outcome
          Df Sum Sq Mean Sq F value   Pr(>F)   
env        1   78.4  78.400      84 0.002746 **
Residuals  3    2.8   0.933   
##

A_rmse=sqrt(0.933)
A_rmse= 0.9659193

I need to calculate B and until E genotype in the same way, but in my actual data, genotype is more than 100, so it would be impossible to calculate one by one. So I'd like to know how to automatically calculate RMSE (=square root of MSE in anova table) per each genotype.

Could you let me know how to do it?

Always many thanks,

Solution

One approach is to use nest the data by genotype. This allows you to create a data frame that contains the results of anova in a list column. You can then use broom::tidy to extract the Mean sq values and calculate the RMSE.

The basis for this is the excellent tutorial Running a model on separate groups.

First, install the packages if required, then load:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

Here's what nesting the data by genotype looks like:

dataA %>% 
  nest(data = -genotype)

# A tibble: 5 × 2
  genotype data            
  <chr>    <list>          
1 A        <tibble [5 × 2]>
2 B        <tibble [5 × 2]>
3 C        <tibble [5 × 2]>
4 D        <tibble [5 × 2]>
5 E        <tibble [5 × 2]>

Now we can use mutate and map to add columns with the anova results, and the metrics extracted by tidy:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy))

# A tibble: 5 × 4
  genotype data             A               results         
  <chr>    <list>           <list>          <list>          
1 A        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
2 B        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
3 C        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
4 D        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>
5 E        <tibble [5 × 2]> <anova [2 × 5]> <tibble [2 × 6]>

Now we unnest the results column and filter for the residuals term:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy)) %>% 
  unnest(results) %>% 
  filter(term == "Residuals")

# A tibble: 5 × 9
  genotype data             A               term         df sumsq meansq statistic p.value
  <chr>    <list>           <list>          <chr>     <int> <dbl>  <dbl>     <dbl>   <dbl>
1 A        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  2.80  0.933        NA      NA
2 B        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3 13.1   4.37         NA      NA
3 C        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  6.4   2.13         NA      NA
4 D        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  2.7   0.9          NA      NA
5 E        <tibble [5 × 2]> <anova [2 × 5]> Residuals     3  5.9   1.97         NA      NA

And finally, calculate RMSE using sqrt(meansq), then select the desired columns.

So the entire process looks like this:

dataA %>% 
  nest(data = -genotype) %>% 
  mutate(A = map(data, ~anova(lm(outcome~env, data = .))), 
         results = map(A, tidy)) %>% 
  unnest(results) %>% 
  filter(term == "Residuals") %>% 
  mutate(rmse = sqrt(meansq)) %>% 
  select(genotype, rmse)

# A tibble: 5 × 2
  genotype  rmse
  <chr>    <dbl>
1 A        0.966
2 B        2.09 
3 C        1.46 
4 D        0.949
5 E        1.40