rdataframedata-integrity

Make R error out when accessing undefined columns in dataframe


This site has lots of questions on how to fix an "undefined column" error.

I have the exact opposite question: how to make an "undefined column" error.

I frequently change variable names in my files.

This leads to the following error:

r$> df <- data.frame(gender=c(1,1,NA,0))
r$> sum(is.na(df$male))
[1] 0

when the correct result is 1.

I want R to print an error message if the column I'm trying to access is undefined.

Not to silently fail.

How can I do that?


Solution

  • Unfortunately R is rather too lenient when it comes to such matters. The $ operator for data.frames is defined to allow accessing non-existent columns and to return NULL in that case.

    There are alternative data.frame implementations which are a bit stricter. Notably, the tbl_df data structure used by the Tidyverse packages ‘tibble’, ‘dplyr’, etc. will at least show you a warning:

    df <- tibble::tibble(gender = c(1, 1, NA, 0))
    sum(is.na(df$male))
    # [1] 0
    # Warning message:
    # Unknown or uninitialised column: `male`.
    

    Alternatively, you can make this a hard error for data.frames by overriding $ for data.frames:

    registerS3method(
      '$', 'tbl_df',
      \(x, name) {
        stopifnot(name %in% colnames(x))
        NextMethod('$')
      }
    )
    

    However, note that this will only apply to plain data.frame, not to tibbles, since the latter also override $. There does not seem to be an option to make this a hard error for tibbles (short of making all warnings into errors); this might be a nice feature request for the package (alternatively, you can make the above code apply to tibbles by replacing 'data.frame' with 'tbl_df).