rpmap

pmap_dbl and mean, Strange results


When I test the calculations for each row, the sum function produces the results properly, while the data for the mean function is only for the first column. I can't quite understand the reason for this……

library(tidyverse)

data <- tibble(
  x = as.numeric(1:9 *50),
  y = as.numeric(10:18),
  z = as.numeric(-10:-2)
)

data %>%
  mutate(sum = pmap_dbl(.,sum),
         mean = pmap_dbl(.,mean))
# A tibble: 9 × 5
      x     y     z   sum  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>
1    50    10   -10    50    50
2   100    11    -9   102   100
3   150    12    -8   154   150
4   200    13    -7   206   200
5   250    14    -6   258   250
6   300    15    -5   310   300
7   350    16    -4   362   350
8   400    17    -3   414   400
9   450    18    -2   466   450

Can anyone help explain this situation?


Solution

  • The function signature of base::sum() is sum(..., na.rm = FALSE), where ... are the values to sum. It will take the sum of all arguments provided, as long as they're not named na.rm, e.g.:

    sum(1, 2) # 3
    sum(2, countme = 3) # 5
    

    On the other hand, the signature for base::mean.default() is mean(x, trim = 0, na.rm = FALSE, ...), where the values which you want to find the average of are supplied in x, a vector, e.g.:

    mean(1, 2, 3) # interpreted as mean(x = 1, trim = 2, na.rm = 3)
    # [1] 1
    mean(c(1, 2, 3)) # mean(x = c(1,2,3), trim = 0, na.rm = FALSE)
    # [1] 2
    

    purrr::pmap_dbl() is passing the arguments separately. If we take, for example, your second row, this is what's happening:

    sum(100, 11, -9)
    # [1] 102
    
    mean(100, 11, -9)
    # [1] 100
    

    This is because your function call is interpreted as mean(100, trim = 11, na.rm = -9). The trim parameter is supposed to be:

    the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.

    Interestingly, the R source for mean.default() contains the line:

    if(trim >= 0.5) return(stats::median(x, na.rm=FALSE))
    

    So actually you are returning the median. You can see that in these examples:

    mean(c(1, 1, 500)) # as expected
    # [1] 167.3333
    mean(c(1, 1, 500), trim = 0.5) # returns the median
    # [1] 1
    

    Of course in your case you are passing mean() a vector of length one so the mean and the median are equal.

    What you want to do in the second row is:

    mean(c(100, 11, -9))
    # [1] 34
    

    Given that you are already calculating the sum, you could use that to calculate the mean, e.g.

    data %>%
        mutate(
            sum = rowSums(.),
            mean = sum / ncol(.)
        )
    # # A tibble: 9 × 5
    #       x     y     z   sum  mean
    #   <dbl> <dbl> <dbl> <dbl> <dbl>
    # 1    50    10   -10    50  16.7
    # 2   100    11    -9   102  34  
    # 3   150    12    -8   154  51.3
    # 4   200    13    -7   206  68.7
    # 5   250    14    -6   258  86  
    # 6   300    15    -5   310 103. 
    # 7   350    16    -4   362 121. 
    # 8   400    17    -3   414 138  
    # 9   450    18    -2   466 155.