rdplyrtidyverserlangtidyselect

How to resolve "no visible global binding for variable" for column name in dplyr tidy-select pipeline?


The following code runs fine, but it fails a lintr check and raises a note from devtools::check(), which is preventing me from submitting my package to CRAN.

# MINIMAL REPRODUCIBLE EXAMPLE OF TIDYSELECT COLUMN SELECTION LINTR ERROR

# Create some dummy sample data
data <- tibble::tibble(
  name = c("John", "Jane", "Jim", "Jill")
)

example_function <- function(
  data
) {
  data |>
    dplyr::mutate(.row_id = seq_len(dplyr::n())) |>
    dplyr::select(.row_id, name) |>
    dplyr::filter(.row_id == 1)
}

print(example_function(data))
> source("/home/chriscarrollsmith/Documents/Software/r-econid/R/experiment.R", encoding = "UTF-8")
# A tibble: 1 × 2
  .row_id name 
    <int> <chr>
1       1 John 
> lintr::lint_package()
R/experiment.R:13:19: warning: [object_usage_linter] no visible binding for global variable '.row_id'
    dplyr::select(.row_id, name) |>
                  ^~~~~~~
R/experiment.R:13:19: warning: [object_usage_linter] no visible binding for global variable '.row_id'
    dplyr::select(.row_id, name) |>
                  ^~~~~~~
R/experiment.R:13:28: warning: [object_usage_linter] no visible binding for global variable 'name'
    dplyr::select(.row_id, name) |>
                           ^~~~

This only happens inside a function. If I perform the operation in the main body of an R script, no linting errors are raised. I assume it's basically because R has no "type hinting" or "type definition" system, so the linter doesn't know what columns to expect from data. (Although you'd think it could infer the existence of .row_id since we added this inside the function body!)

What I've tried:

  1. It's not a symbol conflict. .row_id isn't used anywhere else in the package or by any of the dependencies, and the error still occurs if I change the column name to anything else.

  2. Using the .data pronoun from rlang, as suggested here, fixes the linter error, but raises the following warning:

Warning message:
Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
ℹ Please use `".row_id"` instead of `.data$.row_id`
  1. Using quote marks, as suggested in the deprecation warning, works in select but does not work in filter, where we're using the column name in a mathematical expression. In that case, the expression compares the literal string ".row_id" to the integer 1, and thus always evaluates to FALSE.

  2. Similarly, using a dplyr selection helper like dplyr::all_of works in select but not in filter. In filter, it not only always evaluates to FALSE, but also raises this warning:

Warning message:
There was 1 warning in `dplyr::filter()`.
ℹ In argument: `dplyr::all_of(c(".row_id")) == 1`.
Caused by warning:
! Using `all_of()` outside of a selecting function was deprecated in tidyselect 1.2.0.
ℹ See details at ?tidyselect::faq-selection-context

My questions are: why does this happen, and what is the right way to invoke column names in a dplyr pipeline inside a function without raising any notes, warnings, or errors?


Solution

  • After a little investigation, I concluded that I was confusing two different things.

    select takes <tidy-select> arguments, and the .data pronoun is deprecated here.

    filter takes <data-masking> arguments, and the .data pronoun is not deprecated here.

    So the answer is that you should use quote marks in select, but you should use .data in filter. Thus:

    # MINIMAL REPRODUCIBLE EXAMPLE OF TIDYSELECT COLUMN SELECTION LINTR ERROR
    
    library(dplyr)
    
    # Create some dummy sample data
    data <- tibble::tibble(
      name = c("John", "Jane", "Jim", "Jill")
    )
    
    example_function <- function(
      data
    ) {
      data |>
        dplyr::mutate(.row_id = seq_len(dplyr::n())) |>
        dplyr::select(".row_id", "name") |>
        dplyr::filter(.data$.row_id == 1)
    }
    
    print(example_function(data))
    

    This will evaluate correctly and will not raise any notes or warnings. Note that you have to import dplyr as a separate step using either library(dplyr) or #' @importFrom dplyr .data (if this were a package function). Trying to access it like dplyr::.data$.row_id will raise the following unhelpful error:

    Error in `dplyr::filter()` at r-econid/R/experiment.R:11:3:
    ℹ In argument: `dplyr::.data$.row_id == 1`.
    Caused by error in `example_function()`:
    ! Can't subset `.data` outside of a data mask context.
    Run `dplyr::last_trace()` to see where the error occurred.
    

    Don't be misled by the error's implication that filter is not a data mask context; it is. The problem here is with the way .data is being accessed, from the dplyr namespace rather than the filter context.

    Other data-masking functions include group_by and summarize. Here you should use .data, not quotes.

    As for why these linter errors happen? That I haven't fully figured out, except for the speculation in the OP. But I thought my learnings on the difference between <tidy-select> and <data-masking> were instructive enough to share and record for posterity here.