pmap_dbl and mean, Strange results

When I test the calculations for each row, the sum function produces the results properly, while the data for the mean function is only for the first column. I can't quite understand the reason for this……

library(tidyverse)

data <- tibble(
  x = as.numeric(1:9 *50),
  y = as.numeric(10:18),
  z = as.numeric(-10:-2)
)

data %>%
  mutate(sum = pmap_dbl(.,sum),
         mean = pmap_dbl(.,mean))

# A tibble: 9 × 5
      x     y     z   sum  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>
1    50    10   -10    50    50
2   100    11    -9   102   100
3   150    12    -8   154   150
4   200    13    -7   206   200
5   250    14    -6   258   250
6   300    15    -5   310   300
7   350    16    -4   362   350
8   400    17    -3   414   400
9   450    18    -2   466   450

Can anyone help explain this situation?

Solution

The function signature of base::sum() is sum(..., na.rm = FALSE), where ... are the values to sum. It will take the sum of all arguments provided, as long as they're not named na.rm, e.g.:

sum(1, 2) # 3
sum(2, countme = 3) # 5

On the other hand, the signature for base::mean.default() is mean(x, trim = 0, na.rm = FALSE, ...), where the values which you want to find the average of are supplied in x, a vector, e.g.:

mean(1, 2, 3) # interpreted as mean(x = 1, trim = 2, na.rm = 3)
# [1] 1
mean(c(1, 2, 3)) # mean(x = c(1,2,3), trim = 0, na.rm = FALSE)
# [1] 2

purrr::pmap_dbl() is passing the arguments separately. If we take, for example, your second row, this is what's happening:

sum(100, 11, -9)
# [1] 102

mean(100, 11, -9)
# [1] 100

This is because your function call is interpreted as mean(100, trim = 11, na.rm = -9). The trim parameter is supposed to be:

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.

Interestingly, the R source for mean.default() contains the line:

if(trim >= 0.5) return(stats::median(x, na.rm=FALSE))

So actually you are returning the median. You can see that in these examples:

mean(c(1, 1, 500)) # as expected
# [1] 167.3333
mean(c(1, 1, 500), trim = 0.5) # returns the median
# [1] 1

Of course in your case you are passing mean() a vector of length one so the mean and the median are equal.

What you want to do in the second row is:

mean(c(100, 11, -9))
# [1] 34

Given that you are already calculating the sum, you could use that to calculate the mean, e.g.

data %>%
    mutate(
        sum = rowSums(.),
        mean = sum / ncol(.)
    )
# # A tibble: 9 × 5
#       x     y     z   sum  mean
#   <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    50    10   -10    50  16.7
# 2   100    11    -9   102  34  
# 3   150    12    -8   154  51.3
# 4   200    13    -7   206  68.7
# 5   250    14    -6   258  86  
# 6   300    15    -5   310 103. 
# 7   350    16    -4   362 121. 
# 8   400    17    -3   414 138  
# 9   450    18    -2   466 155.