rfunctionmodelsummarytables-package

datasummary: Combine factor and numeric variables in a single table


I am trying to create a table with factor and numeric variables using modelsummary. The way I am doing this is by converting factor variables to numeric so that only 1 line appears for each factor variable and all variables appear in the same column. Then, I will manually calculate the number of units for each level of each previously factor/now numeric variable and assign this as text to each variable in my dataset. I am trying to do this as per the function called N_alt in the example below:

library(modelsummary)
library(kableExtra)

tmp <- mtcars[, c("mpg", "hp")]

tmp$class <- 0
tmp$class[15:32] <- 1
tmp$class <- as.factor(tmp$class)

tmp$region <- 1
tmp$region[15:20] <- 2
tmp$region[21:32] <- 3
tmp$region <- as.factor(tmp$region)

tmp$class <- 0
tmp$region <- 0

N_alt = function(x) {
  if (x %in% c(tmp$class)) {
    paste0('[14 (43.8); 18 (56.3)]') 
  } else if (x %in% c(tmp$region)) {
    paste0('[14 (43.8); 6 (18.8); 12 (37.5)]')  
  } else {
    paste0('[32 (100)]')
  }
}


# create a table with `datasummary`
emptycol = function(x) " "
datasummary(mpg + (`class [0,1]`= class) + (`region [A,B,C]`= region) + hp ~ Heading("N (%)") * N_alt, data = tmp)

which gives me: enter image description here

My N_alt function does not work properly. class is correct, but region is not. I am not getting any warning messages.

I have also tried:

N_alt = function(x) {
  if (x[1] %in% c(tmp$class)) {
    paste0('[14 (43.8); 18 (56.3)]') 
  } else if (x[1] %in% c(tmp$region)) {
    paste0('[14 (43.8); 6 (18.8); 12 (37.5)]')  
  } else {
    paste0('[32 (100)]')
  }
}

but I obtained the same output. I have created similar functions with these vectors and they worked fine, but this one for some reason it is not working.

Additionally, I also tried:

N_alt <- c('[32 (100)]','[14 (43.8); 18 (56.3)]','[14 (43.8); 6 (18.8); 12 (37.5)]','[32 (100)]')

and

N_alt <- c(rep('[32 (100)]',32),rep('[14 (43.8); 18 (56.3)]',32),rep('[14 (43.8); 6 (18.8); 12 (37.5)]',32),rep('[32 (100)]',32))

but I get:

Error in datasummary(mpg + (`class [0,1]` = class) + (`region [A,B,C]` = region) +  : 
  Argument 'N_alt' is not length 32

Does anyone know what I am missing here?

Edit:

It seems to be possible to run functions just as the below Mean_alt so that certain numeric variables do not have decimal places (just converting them to as.integer did not work for me) and previously factor/now numeric variables do not show any results for Mean in the table (two different actions), as per the below:

library(modelsummary)
library(kableExtra)

tmp <- mtcars[, c("mpg", "hp")]

tmp$class <- 0
tmp$class[15:32] <- 1
tmp$class <- as.factor(tmp$class)

tmp$region <- 1
tmp$region[15:20] <- 2
tmp$region[21:32] <- 3
tmp$region <- as.factor(tmp$region)

tmp$class <- 0
tmp$region <- 0

N_alt = function(x) {
  if (x %in% c(tmp$class)) {
    paste0('[14 (43.8); 18 (56.3)]') 
  } else if (x %in% c(tmp$region)) {
    paste0('[14 (43.8); 6 (18.8); 12 (37.5)]')  
  } else {
    paste0('[32 (100)]')
  }
}

Mean_alt = function(x) {
  if (x %in% c(tmp$mpg)) {
    as.character(floor(mean(x)), length=5)
  } else if (x %in% c(tmp$class, tmp$region)) {
    paste0("")
  } else {
    mean(x)
  }
}

# create a table with `datasummary`
emptycol = function(x) " "
datasummary(mpg + (`class [0,1]`= class) + (`region [A,B,C]`= region) + hp ~ Heading("N (%)") * N_alt + Heading("Mean") * Mean_alt, data = tmp)

output: enter image description here


Solution

  • You are running against three limitations.

    The first limitation is in Base R:

    1. As explained in the R manual, the statements in an if/else must evaluate to a single TRUE or FALSE. Internally, datasummary will apply the N_alt to each variable one after the other. Each time, N_alt receives a new vector of length 32. Frankly, I don’t think it makes much sense to check the value of the first element of that vector; I don’t see how this can get us where we want to go.

    The two other limitations have to do with the fundamental design of the tables package, on which modelsummary::datasummary is based:

    1. Factors will always generate one row per factor level.
    2. I don’t think there is a good way to tell datasummary that a function should behave differently when applied to different numeric variables. This is because each function only sees the raw numeric vector, and not other meta-information.

    I think the easiest workaround is to create two tables, one for your factors and one for your numeric. Then, these tables can easily be combined:

    library(modelsummary)
    
    N_factor <- function(x) {
      count <- table(x)
      pct <- prop.table(count)
      out <- paste(sprintf("%.0f (%.1f)", count, pct), collapse = "; ")
      sprintf("[%s]", out)
    }
    
    N_numeric <- function(x) {
      sprintf("%s (100)", length(x))
    }
    
    tab_fac <- datasummary(cyl + gear ~ Heading("N") * N_factor, 
                           output = "data.frame",
                           data = mtcars)
    
    datasummary(mpg + hp ~ Heading("N") * N_numeric, 
                add_rows = tab_fac,
                data = mtcars)
    
    N
    mpg 32 (100)
    hp 32 (100)
    cyl [11 (0.3); 7 (0.2); 14 (0.4)]
    gear [15 (0.5); 12 (0.4); 5 (0.2)]