I'm trying to use the tbl_summary()
function from the gtsummary
package to summarize survey data.
However, I'm finding it challenging to deal with numeric columns that contain placeholder values.
For example, I might be working with a numeric variable called weight
which includes the values 998 and 999 to represent "Don't know/No answer" and "Refused", respectively, as specified in the data dictionary.
If I use tbl_summary()
as is, these placeholders are included in the calculation of the descriptive statistics, which is misleading:
library(tidyverse)
library(gtsummary)
df <- tibble(
weight = c(
72, 83, 65, 998, 79,
56, 999, 92, 67, 84,
998, 75, 68, 999, NA,
998, 73, 80, 91, NA
)
)
tbl_summary(
data = df,
statistic = all_continuous() ~ "{mean} ({sd})"
)
Ideally, I'd like the mean and SD to be calculated without the placeholder values, and to display the count and proportion of each placeholder value separately.
I've tried using custom functions to filter out the placeholder values for the mean and SD calculations, and then added separate rows manually for each placeholder value with its count and proportion:
library(tidyverse)
library(gtsummary)
df <- tibble(
weight = c(
72, 83, 65, 998, 79,
56, 999, 92, 67, 84,
998, 75, 68, 999, NA,
998, 73, 80, 91, NA
)
)
mean_custom <- function(x) {
x <- na.omit(x)
x <- x[!x %in% c(998, 999)]
mean(x)
}
sd_custom <- function(x) {
x <- na.omit(x)
x <- x[!x %in% c(998, 999)]
sd(x)
}
# calculate count for "Don't know/No answer"
n_998 <- function(x) {
sum(x == 998)
}
# calculate proportion for "Don't know/No answer"
prop_998 <- function(x) {
sum(x == 998) / length(x) * 100
}
# calculate count for "Refused"
n_999 <- function(x) {
sum(x == 999)
}
# calculate proportion for "Refused"
prop_999 <- function(x) {
sum(x == 999) / length(x) * 100
}
tbl_summary(df,
type = list(weight = "continuous2"),
statistic = all_continuous() ~ c(
"{mean_custom} ({sd_custom})",
"{n_998} ({prop_998}%)",
"{n_999} ({prop_999}%)"
)
)
I would still need to provide more descriptive labels for the summary statistics (e.g., relabeling "n_999 (prop_999%)" as Refused).
That being said, is there a more streamlined approach in gtsummary
for handling numeric columns with placeholder values (or at least an easier way to manage the labeling of summary statistics for these special values)?
Thanks!
The R language doesn't support multiple types of missing information like some other languages. I suggest you create two variables: one with the result, and the other about the status/missingness. You can then put both in the table. See an example below for how I would do it.
library(tidyverse)
library(gtsummary)
df <- tibble(
weight = c(
72, 83, 65, 998, 79,
56, 999, 92, 67, 84,
998, 75, 68, 999, NA,
998, 73, 80, 91, NA
)
) |>
mutate(
weight_status =
case_when(
weight == 999 ~ "Don't Know",
weight == 998 ~ "Refused",
is.na(weight) ~ "Missing",
!is.na(weight) ~ "Answered"
) |>
factor(levels = c("Answered", "Don't Know", "Refused", "Missing")),
weight = ifelse(weight %in% c(999, 998) , NA, weight)
)
tbl_summary(
data = df,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ "{mean} ({sd})",
missing = "no"
) |>
# remove header row for status
remove_row_type(variables = "weight_status") |>
# remove "Answered" level
modify_table_body(~filter(.x, label != "Answered")) |>
bold_labels() |>
as_kable() # conert to kable to display table on SO
Characteristic | N = 20 |
---|---|
weight | |
Mean (SD) | 76 (10) |
Don’t Know | 2 (10%) |
Refused | 3 (15%) |
Missing | 2 (10%) |
Created on 2024-11-05 with reprex v2.1.1