Coding Objective:
Make a tidy compatible function that I can pipe a df into to calculate pooled standard deviation. It needs to respect group_by() groupings as well for the summarise() output.
Underlying Objective:
Report pooled standard deviation statistics for each treatment x timepoint x condition sub-grouping when there is unequal sampling. It is assumed that all the subjects come from the same population.
I think I want to use this pooled standard deviation formula, but am uncertain:
The Data:
For each sub-condition, I calculated the arithmetic lognormal variance for each subject as well as the number of observations.
df <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8), intervention = c("trt",
"trt", "trt", "trt", "ctrl", "ctrl", "ctrl", "ctrl", "trt", "trt",
"trt", "trt", "trt", "trt", "trt", "trt", "trt", "trt", "trt",
"trt", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl", "ctrl",
"ctrl", "ctrl", "ctrl", "ctrl", "ctrl"), timepoint = c("t1",
"t1", "t2", "t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2",
"t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2",
"t2", "t1", "t1", "t2", "t2", "t1", "t1", "t2", "t2"), lognormal_var = c(0.16,
0.218, 0.15, 0.368, 0.262, 0.283, 0.21, 0.218, 0.167, 0.211,
0.205, 0.278, 0.261, 0.301, 0.173, 0.313, 0.187, 0.27, 0.168,
0.262, 0.162, 0.272, 0.202, 0.195, 0.176, 0.181, 0.152, 0.257,
0.144, 0.163, 0.284, 0.269), n = c(1223, 1011, 830, 597, 1048,
1332, 672, 380, 623, 400, 1006, 820, 967, 752, 799, 884, 945,
941, 675, 395, 815, 787, 823, 753, 861, 727, 1508, 1255, 762,
791, 1382, 939), condition = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2)), row.names = c(NA, -32L), class = c("tbl_df", "tbl", "data.frame"
))
I have 8 subjects, 4 in each treatment condition. Each subject is repeatedly sampled at 2 timepoints, and within each timepoint they are assessed in 2 different conditions. Within each timepoint x condition, subjects are variably sampled between a couple hundred to a couple thousand times.
What I tried:
Latest Attempt Edit: 3rd attempt
This doesn't respect the groupings from the group_by(). The column pooled_lognormal_sd is correct per the solution provided by u/Ricardo Semião e Castro .
pooled_var <- function(.data){
var <- .data$lognormal_var
sqrt(sum(var * ( .data$n - 1 )) / (sum(.data$n) - nrow(.data)))
}
df |>
group_by(intervention, timepoint, condition) %>%
summarise(pooled_var(.),
pooled_lognormal_sd = sqrt(sum(lognormal_var * ( n - 1 )) / (sum(n) - n())))
Log-normal variance function:
lognormal_var <- function(x) {
exp(2*mean(log(x)) + var(log(x)))*(exp(var(log(x))) - 1)
}
1st attempt:
pooled_var <- function(df,
var = "lognormal_var"){
var <- if(var %in% names(df)) df$var else print0("Variable name not found in df")
sum(var * ( df$n - 1 )) / (sum(df$n) - nrow(df))
}
library(dplyr)
df |>
group_by(intervention, timepoint, condition) %>%
pooled_var(var = "lognormal_var") |>
sqrt()
Which yields the error: "Warning: Unknown or uninitialised column: var
.2 0". I expect 8 rows to be returned for each combination of timepoint, treatment, and condition.
You created the groups but then didn't used a dplyr verb to apply your function to every group, that's why you only got one value. You want to use summarise
.
Also, as you're using a tidyverse approach, you don't need to make a function that takes the column as a string. dplyr is built around calling column directly.
pooled_var <- function(var, n, k) sqrt(sum(var * ( n - 1 )) / (sum(n) - k))
df %>%
group_by(intervention, timepoint, condition) %>%
summarise(sigma = pooled_var(lognormal_var, n, n()))
Now, lognormal_var
isn't the whole df
column, but respects the groups. The same is true for n
and the number of lines in each group (which I imagine is what "k" should be), n()
.
Result:
# A tibble: 8 × 4
# Groups: intervention, timepoint [4]
intervention timepoint condition sigma
<chr> <chr> <dbl> <dbl>
1 ctrl t1 1 0.438
2 ctrl t1 2 0.484
3 ctrl t2 1 0.460
4 ctrl t2 2 0.492
5 trt t1 1 0.440
6 trt t1 2 0.503
7 trt t2 1 0.419
8 trt t2 2 0.554
Obs: I'm not 100% sure if this is respecting your formula, you should check if the result is reasonable.