rfrequencysummarytools

Saving freq() object into data frame


I am looping a few frequency tables with the freq() command in summarytools and printing the results. In doing so, I noticed that when I am trying to save the freq() object without missing values and convert it to a data frame, the total observations still keeps the missing values.

# Create a vector with 10 observations of "smoker"
smoker <- c("yes", "no", "yes", NA, NA, NA, "yes", "no", "yes", "no")

# Create a DataFrame using the vector
df <- data.frame(smoker)

library(summarytools)
library(dplyr)

# Create a frequency table without missing values
freq(df$smoker, report.nas = FALSE)

# Try to save this table into a data frame
table <- as.data.frame(freq(df$smoker, report.nas = FALSE))  # OR
  table <- df %>% freq(smoker, report.nas = FALSE) %>% as.data.frame()
table

The results should look like this (missing values excluded, n=7):

          Freq        %   % Cum.
     no      3    42.86    42.86
    yes      4    57.14   100.00
  Total      7   100.00   100.00

But after saving it to a data.frame, it looks like this (missing values added back on, with total n=10):

      Freq   % Valid % Valid Cum. % Total % Total Cum.
no       3  42.85714     42.85714      30           30
yes      4  57.14286    100.00000      40           70
<NA>     3        NA           NA      30          100
Total   10 100.00000    100.00000     100          100

This seems like a bug but not sure if this is the expected outcome. Any thoughts on how to save this output as a data.frame? I'm hoping to loop the data frame and add kable styling.


Solution

  • Using report.nas only affects the printing of the NA values, not the storage of them. If we store the freq object as see:

    see <- summarytools::freq(df$smoker, report.nas = FALSE)
    

    You can see it prints the values as desired:

    # Frequencies  
    # df$smoker  
    # Type: Character  
    # 
    #        Freq        %   % Cum.
    # ----------- ------ -------- --------
    #          no      3    42.86    42.86
    #         yes      4    57.14   100.00
    #       Total      7   100.00   100.00
    

    But it stores them with the NA values:

    enter image description here

    So you will still need to subset to get what you want, this approach is simply using !is.na() on the percent valid column:

    want <- as.data.frame(see[!is.na(see[,2]),])
    
    #       Freq   % Valid % Valid Cum. % Total % Total Cum.
    # no       3  42.85714     42.85714      30           30
    # yes      4  57.14286    100.00000      40           70
    # Total   10 100.00000    100.00000     100          100
    

    enter image description here