I am constructing repeated cross-sectional (a panel) from multiple years of survey data. The survey data for each year is published as a separate dataframe with inconsistencies in variable names (column names) and sometimes variables are missing entirely. There are 450 columns so manual previewing is out of the question. As such, I want to search each dataframe to check for the existence of a list of variables of interest.
I first tried dplyr::select()
but this is insufficient because it walks through the column names in order and only produces a warning for the first column name that can't be found in the dataframe when there may be in fact mutliple:
df <- data.frame(c1 = 1:5,
c2 = letters[1:5],
c3 = letters[1:5],
c4 = letters[1:5],
c5 = letters[1:5],
c6 = letters[1:5])
>
> select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can't subset columns that don't exist.
ā Column `c7` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.
The answer to this similar question (grep() with multiple column names in data frame) I also find unsatisfactory.
which(!is.na(match(colnames(df), c("c1", "c2", "c7", "c10", "c11"))))
[1] 1 2
The position of the matching column names is returned (not the names) but it is the non-matching column names which are the most important for us to know, the real problem has 10 or more variables of interest so a human friendly output is important.
Does anyone know of a better way? Perhaps a function that would produce a table of matches (TRUE
/FALSE
) for each supplied column name?
You can use %in%
with !
to get a logical vector of columns that are not present in the data set:
cols <- c("c1", "c2", "c7", "c10", "c11")
!cols %in% colnames(df)
#[1] FALSE FALSE TRUE TRUE TRUE
Or with setdiff
to get the names:
setdiff(cols, colnames(df))
#[1] "c7" "c10" "c11"
And to select directly the columns in the data.frame without error messages, use any_of
:
library(dplyr)
select(df, any_of(cols))
# c1 c2
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e