rdataframematchgrepl

How to check for existence of a set of columns in large dataframes?


I am constructing repeated cross-sectional (a panel) from multiple years of survey data. The survey data for each year is published as a separate dataframe with inconsistencies in variable names (column names) and sometimes variables are missing entirely. There are 450 columns so manual previewing is out of the question. As such, I want to search each dataframe to check for the existence of a list of variables of interest.

I first tried dplyr::select() but this is insufficient because it walks through the column names in order and only produces a warning for the first column name that can't be found in the dataframe when there may be in fact mutliple:

df <- data.frame(c1 = 1:5,
                  c2 = letters[1:5],
                  c3 = letters[1:5],
                  c4 = letters[1:5],
                  c5 = letters[1:5],
                  c6 = letters[1:5])
> 
> select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can't subset columns that don't exist.
āœ– Column `c7` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.

The answer to this similar question (grep() with multiple column names in data frame) I also find unsatisfactory.

which(!is.na(match(colnames(df), c("c1", "c2", "c7", "c10", "c11"))))
[1] 1 2

The position of the matching column names is returned (not the names) but it is the non-matching column names which are the most important for us to know, the real problem has 10 or more variables of interest so a human friendly output is important.

Does anyone know of a better way? Perhaps a function that would produce a table of matches (TRUE/FALSE) for each supplied column name?


Solution

  • You can use %in% with ! to get a logical vector of columns that are not present in the data set:

    cols <- c("c1", "c2", "c7", "c10", "c11")
    !cols %in% colnames(df)
    #[1] FALSE FALSE  TRUE  TRUE  TRUE
    

    Or with setdiff to get the names:

    setdiff(cols, colnames(df))
    #[1] "c7"  "c10" "c11"
    

    And to select directly the columns in the data.frame without error messages, use any_of:

    library(dplyr)
    select(df, any_of(cols))
    #   c1 c2
    # 1  1  a
    # 2  2  b
    # 3  3  c
    # 4  4  d
    # 5  5  e