ryardsticktidymodels

yardstick::rmse on grouped data returns error and incorrect results


I wanted to evaluate the performance of several regression model and used the yardstick package to calculate the RMSE. Here is some example data

  model obs pred
1     A   1    1
2     B   1    2
3     C   1    3

When I run the following code

library(yardstick)
library(dplyr)
dat %>%
 group_by(model) %>%
 summarise(RMSE = yardstick::rmse(truth = obs, estimate = pred))

I get the following error

Error in summarise_impl(.data, dots) : no applicable method for 'rmse' applied to an object of class "c('double', 'numeric')".

However, when I explicitly supply . as the first argument (which should not be necessary, I thought), I get no error, but the results are incorrect.

dat %>%
 group_by(model) %>%
 summarise(RMSE = yardstick::rmse(., truth = obs, estimate = pred))
# A tibble: 3 x 2
  model   RMSE
  <fctr> <dbl>
1 A       1.29
2 B       1.29
3 C       1.29

I was expecting the following

# A tibble: 3 x 2
  model   RMSE
  <fctr> <dbl>
1 A       0
2 B       1.00
3 C       2.00

I know that there are alternatives to this function but still I don't understand this behavior.

data

dat <- structure(list(model = structure(1:3, .Label = c("A", "B", "C"), class = "factor"), obs = c(1, 1, 1), pred = 1:3), .Names = c("model", "obs", "pred"), row.names = c(NA, -3L), class = "data.frame")

Solution

  • We can use the do function to apply the rmse function to every group.

    dat %>%
      group_by(model) %>%
      do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1], 
         RMSE = yardstick::rmse(., truth = obs, estimate = pred)))
    # # A tibble: 3 x 4
    # # Groups: model [3]
    # model    obs  pred  RMSE
    #  <fctr> <dbl> <int> <dbl>
    # 1 A       1.00     1  0   
    # 2 B       1.00     2  1.00
    # 3 C       1.00     3  2.00
    

    Or we can split the data frame and apply the rmse function.

    dat %>%
      mutate(RMSE = dat %>%
               split(.$model) %>%
               sapply(yardstick::rmse, truth = obs, estimate = pred))
    #   model obs pred RMSE
    # 1     A   1    1    0
    # 2     B   1    2    1
    # 3     C   1    3    2
    

    Or we can nest the obs and pred column to a list column and then apply the rmse function.

    library(tidyr)
    
    dat %>%
      nest(obs, pred) %>%
      mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
      unnest()
    #   model RMSE obs pred
    # 1     A    0   1    1
    # 2     B    1   1    2
    # 3     C    2   1    3
    

    The output of these three methods are a little bit different, but all contain the right RMSE calculation. Here I use the microbenchmark package to conduct a performance evaluation.

    library(microbenchmark)
    
    microbenchmark(m1 = {dat %>%
        group_by(model) %>%
        do(data_frame(model = .$model[1], obs = .$obs[1], pred = .$pred[1], 
                      RMSE = yardstick::rmse(., truth = obs, estimate = pred)))},
        m2 = {dat %>%
            mutate(RMSE = dat %>%
                     split(.$model) %>%
                     sapply(yardstick::rmse, truth = obs, estimate = pred))},
        m3 = {dat %>%
            nest(obs, pred) %>%
            mutate(RMSE = sapply(data, yardstick::rmse, truth = obs, estimate = pred)) %>%
            unnest()})
    
    # Unit: milliseconds
    # expr      min       lq     mean   median       uq       max neval
    #   m1 43.18746 46.71055 50.23383 48.46554 51.05639 174.46371   100
    #   m2 14.08516 14.78093 16.14605 15.74505 16.89936  24.02136   100
    #   m3 28.99795 30.90407 32.71092 31.89954 33.94729  44.57953   100
    

    The result shows that m2 is the fastest, while m1 is the slowest. I think the implication is do operation is usually slower then other methods, so if possible, we should avoid the do operation. Although m2 is the fastest, personally I like the syntax of m3 the best. The nested data frame will allow us to easily summarize information between different models or different groups.