When I test the calculations for each row, the sum function produces the results properly, while the data for the mean function is only for the first column. I can't quite understand the reason for this……
library(tidyverse)
data <- tibble(
x = as.numeric(1:9 *50),
y = as.numeric(10:18),
z = as.numeric(-10:-2)
)
data %>%
mutate(sum = pmap_dbl(.,sum),
mean = pmap_dbl(.,mean))
# A tibble: 9 × 5
x y z sum mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 50 10 -10 50 50
2 100 11 -9 102 100
3 150 12 -8 154 150
4 200 13 -7 206 200
5 250 14 -6 258 250
6 300 15 -5 310 300
7 350 16 -4 362 350
8 400 17 -3 414 400
9 450 18 -2 466 450
Can anyone help explain this situation?
The function signature of base::sum()
is sum(..., na.rm = FALSE)
, where ...
are the values to sum. It will take the sum of all arguments provided, as long as they're not named na.rm
, e.g.:
sum(1, 2) # 3
sum(2, countme = 3) # 5
On the other hand, the signature for base::mean.default()
is mean(x, trim = 0, na.rm = FALSE, ...)
, where the values which you want to find the average of are supplied in x
, a vector, e.g.:
mean(1, 2, 3) # interpreted as mean(x = 1, trim = 2, na.rm = 3)
# [1] 1
mean(c(1, 2, 3)) # mean(x = c(1,2,3), trim = 0, na.rm = FALSE)
# [1] 2
purrr::pmap_dbl()
is passing the arguments separately. If we take, for example, your second row, this is what's happening:
sum(100, 11, -9)
# [1] 102
mean(100, 11, -9)
# [1] 100
This is because your function call is interpreted as mean(100, trim = 11, na.rm = -9)
. The trim
parameter is supposed to be:
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.
Interestingly, the R source for mean.default()
contains the line:
if(trim >= 0.5) return(stats::median(x, na.rm=FALSE))
So actually you are returning the median. You can see that in these examples:
mean(c(1, 1, 500)) # as expected
# [1] 167.3333
mean(c(1, 1, 500), trim = 0.5) # returns the median
# [1] 1
Of course in your case you are passing mean()
a vector of length one so the mean and the median are equal.
What you want to do in the second row is:
mean(c(100, 11, -9))
# [1] 34
Given that you are already calculating the sum, you could use that to calculate the mean, e.g.
data %>%
mutate(
sum = rowSums(.),
mean = sum / ncol(.)
)
# # A tibble: 9 × 5
# x y z sum mean
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 50 10 -10 50 16.7
# 2 100 11 -9 102 34
# 3 150 12 -8 154 51.3
# 4 200 13 -7 206 68.7
# 5 250 14 -6 258 86
# 6 300 15 -5 310 103.
# 7 350 16 -4 362 121.
# 8 400 17 -3 414 138
# 9 450 18 -2 466 155.