rdataframefunctionfiltersubset

How to get column-wise summary statistics with missing codes?


I have written a custom function ord_table() to extract summary statistics from a series of databases. To get those summary statistics, I have to filter out missing data codes (all codes are large negative numbers).

> data
     var1 var2   var3   var4
1       3    2      2      1
2  -99999    3      1      1
3       3    5 -99999      2
4       2    3      3      3
5       2    3      4      4
6  -99999    1 -99999      5
7       3    4      1      5
8       5    1      3      3
9       4    1      5 -99999
10 -99999    5      4      1
11 -99999    3      2      2
12      1    2      5      5

The column names are in a vector ord_vars.

I am applying the function with lapply.

ordinal <- lapply(ord_vars, ord_table)

Here is my function:

ord_table <- function(x) {
  table_ord <- data.frame(Median=double(length=1), 
                          Minimum=double(length=1), 
                          Maximum=double(length=1), 
                          IQR=double(length=1))
  table_ord$Median <- data %>% subset(x > 0) %>% pull(x) %>% median()
  table_ord$Minimum <- data %>% subset(x > 0) %>% pull(x) %>% min()
  table_ord$Maximum <- data %>% subset(x > 0) %>% pull(x) %>% max()
  table_ord$IQR <- data %>% subset(x > 0) %>% pull(x) %>% IQR() 
  return(table_ord)
}

It is all working except for the subsetting to remove the negative values that code for missing data.

I have tried filter() and subset(), and both with unquote(x) instead of x.

For each dataframe, I get results including missing data codes:

[[1]]
  Median Minimum Maximum      IQR
1      2  -99999       5 100002.2

Where I want to get:

[[1]]
  Median Minimum Maximum  IQR
1    2.5       2       5 2.25

data

set.seed(123)  ## for sake of reproducibility
a <- c(1, 2, 3, 4, 5, -99999)
var1 <- sample(a, 12, replace=TRUE)
var2 <- sample(a, 12, replace=TRUE)
var3 <- sample(a, 12, replace=TRUE)
var4 <- sample(a, 12, replace=TRUE)
data <- cbind(var1, var2, var3, var4) %>% as.data.frame()

ord_vars <- data %>% colnames() %>% as.vector() 

Solution

  • If you use filter, you can create a symbol from you string, then defuse with {{, so:

    ord_table <- function (x) {
      table_ord <- data.frame(Median = double(length = 1),
                              Minimum = double(length = 1),
                              Maximum = double(length = 1),
                              IQR = double(length = 1)
      )
      x <- sym(x)
      table_ord$Median <- data %>% filter({{x}} > 0) %>% pull(x) %>% median()
      table_ord$Minimum <- data %>% filter({{x}} > 0) %>% pull(x) %>% min()
      table_ord$Maximum <- data %>% filter({{x}} > 0) %>% pull(x) %>% max()
      table_ord$IQR <- data %>% filter({{x}} > 0) %>% pull(x) %>% IQR() 
      return(table_ord) 
    }
    

    But I would probably do something like this, and actually set the values as NA (ideally you would do this when you read in the data):

    library(tidyverse)
    data |> 
      mutate(across(test_vars, \(v) na_if(v, -99999))) |> 
      pivot_longer(cols = test_vars) |> 
      drop_na() |> 
      summarise(
        mean = mean(value), min = min(value), max = max(value), IQR = IQR(value),
        .by = name
      )
    
    # A tibble: 4 × 5
      name   mean   min   max   IQR
      <chr> <dbl> <dbl> <dbl> <dbl>
    1 var3   3.2      1     5   2  
    2 var4   2.58     1     5   1.5
    3 var2   2.44     1     5   2  
    4 var1   3.33     1     5   1