rcrosstabcontingencyexpsssummarytools

R markdown contingency table %>% tabulate column variables with selective values


I am quite new to R, coming from Stata. Below is the r markdown chunk with reproducible data example. The data is representative to the data i am working with. But only with more binary (logical) and factor variables in number.

The libraries and data:

# Setup and load package:
library(dplyr)
library(expss)
library(hablar)
library(kableExtra)
library(summarytools)

# Load data:
data("mtcars")
raw_df <- select(mtcars,c(wt,cyl,gear,vs,am))

# Data prep and labelling:
df <- raw_df %>%
  apply_labels(wt = "Facility ID",
               cyl = "Geographical Area",
               cyl = c("Area A" = 4,"Area B" = 6, "Area C" = 8),
               gear = "Tier",
               gear = c("Tier 1" = 3, "Tier 2" = 4, "Tier 3" = 5),
               vs = "E.coli",
               am = "V.choleri") %>%
  convert(chr(wt),
          fct(cyl,gear),
          lgl(vs,am))

Please note that in my actual data there are more categorical and logical variables. I have managed to make the following table in r markdown (html output):


df %>%
  tab_cells(cyl, gear) %>%
  tab_total_row_position("below") %>%
  tab_total_statistic("u_rpct")%>%
  tab_total_label("Total hosts (Row proportions)") %>% 
  tab_cols(vs, am) %>% 
  tab_stat_rpct() %>% 
  tab_cols(total(label = "Number of hosts")) %>%  
  tab_stat_cases() %>%
  tab_pivot(stat_position = "outside_columns") %>%
  recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy) %>% 
  split_table_to_df() %>% 
  kable(align = "c", digits = 1) %>% 
  kable_styling(bootstrap_options = c("striped", "condensed", "responsive"),
                full_width = F, position = "center") %>% 
  row_spec(1:2, bold = TRUE)

Problems: 1. I wish i could include only "TRUE" columns, dropping "FALSE" columns from the table. But keeping the 1st Row label intact ("E. coli", "V.choleri"). In fact i would not be needing the 2nd row ("TRUE","FALSE) 2. I have labelled the "Total Row proportion" (#Total hosts), But can not remove the leading "#" sign. In the right most column cell of the row with the "Total row proportion", it shows "100". I tried it to be the summation of column cells instead, but failed. "100" is totally misleading. 3. I have also tried to get my desired table through "ctable" function of "summarytools" package. As it has an excellent structure, with number of observations also induced within proportion cells. :

print(ctable(df$cyl,df$am), method = 'render')

But the problem is it seems to permit only one pair of categorical variables. And also, the "FALSE" can not be omited. But the last column is perfect with rowtotals (observations)

Details: R : 4.0.0 R studio: 1.2.5042 The packages are all up-to-date.


Solution

  • Tables from expss are usual data.frames. Column labels is just column names with rows separated with "|" symbol. So, you can manipulate them as usual column names. Row labels are located in the column row_labels and we can remove '#' sign with search and replace operations. "Total row proportion" shows "100" because at the beginning you specify total statistic as row percent and row percent for single column is 100. Taking into account all the above:

    library(dplyr)
    library(expss)
    library(hablar)
    library(kableExtra)
    library(summarytools)
    
    # Load data:
    data("mtcars")
    raw_df <- select(mtcars,c(wt,cyl,gear,vs,am))
    
    # Data prep and labelling:
    df <- raw_df %>%
        apply_labels(wt = "Facility ID",
                     cyl = "Geographical Area",
                     cyl = c("Area A" = 4,"Area B" = 6, "Area C" = 8),
                     gear = "Tier",
                     gear = c("Tier 1" = 3, "Tier 2" = 4, "Tier 3" = 5),
                     vs = "E.coli",
                     am = "V.choleri") %>%
        convert(chr(wt),
                fct(cyl,gear),
                lgl(vs,am))
    
    
    tbl = df %>%
        tab_cells(cyl, gear) %>%
        tab_total_row_position("below") %>%
        tab_total_statistic("u_rpct")%>%
        tab_total_label("Total hosts (Row proportions)") %>% 
        tab_cols(vs, am) %>% 
        tab_stat_rpct() %>% 
        tab_cols(total(label = "Number of hosts")) %>%  
        # specify total statistic for last column
        tab_stat_cases(total_statistic = "u_cases") %>%
        tab_pivot(stat_position = "outside_columns") %>%
        recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy) %>% 
        # remove columns with FALSE
        except(contains("FALSE")) %>% 
        compute(
            # remove '#' sign from row labels
            row_labels = gsub("#", "", row_labels)
        )
    
    # remove '#' sign from column labels
    colnames(tbl) = gsub("\\|TRUE", "", colnames(tbl))
    
    tbl %>% 
        split_table_to_df() %>% 
        kable(align = "c", digits = 1) %>% 
        kable_styling(bootstrap_options = c("striped", "condensed", "responsive"),
                      full_width = F, position = "center") %>% 
        row_spec(1:2, bold = TRUE)