The following code runs fine, but it fails a lintr
check and raises a note from devtools::check()
, which is preventing me from submitting my package to CRAN.
# MINIMAL REPRODUCIBLE EXAMPLE OF TIDYSELECT COLUMN SELECTION LINTR ERROR
# Create some dummy sample data
data <- tibble::tibble(
name = c("John", "Jane", "Jim", "Jill")
)
example_function <- function(
data
) {
data |>
dplyr::mutate(.row_id = seq_len(dplyr::n())) |>
dplyr::select(.row_id, name) |>
dplyr::filter(.row_id == 1)
}
print(example_function(data))
> source("/home/chriscarrollsmith/Documents/Software/r-econid/R/experiment.R", encoding = "UTF-8")
# A tibble: 1 × 2
.row_id name
<int> <chr>
1 1 John
> lintr::lint_package()
R/experiment.R:13:19: warning: [object_usage_linter] no visible binding for global variable '.row_id'
dplyr::select(.row_id, name) |>
^~~~~~~
R/experiment.R:13:19: warning: [object_usage_linter] no visible binding for global variable '.row_id'
dplyr::select(.row_id, name) |>
^~~~~~~
R/experiment.R:13:28: warning: [object_usage_linter] no visible binding for global variable 'name'
dplyr::select(.row_id, name) |>
^~~~
This only happens inside a function. If I perform the operation in the main body of an R script, no linting errors are raised. I assume it's basically because R has no "type hinting" or "type definition" system, so the linter doesn't know what columns to expect from data
. (Although you'd think it could infer the existence of .row_id
since we added this inside the function body!)
What I've tried:
It's not a symbol conflict. .row_id
isn't used anywhere else in the package or by any of the dependencies, and the error still occurs if I change the column name to anything else.
Using the .data
pronoun from rlang
, as suggested here, fixes the linter error, but raises the following warning:
Warning message:
Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
ℹ Please use `".row_id"` instead of `.data$.row_id`
Using quote marks, as suggested in the deprecation warning, works in select
but does not work in filter
, where we're using the column name in a mathematical expression. In that case, the expression compares the literal string ".row_id" to the integer 1, and thus always evaluates to FALSE
.
Similarly, using a dplyr selection helper like dplyr::all_of
works in select
but not in filter
. In filter
, it not only always evaluates to FALSE
, but also raises this warning:
Warning message:
There was 1 warning in `dplyr::filter()`.
ℹ In argument: `dplyr::all_of(c(".row_id")) == 1`.
Caused by warning:
! Using `all_of()` outside of a selecting function was deprecated in tidyselect 1.2.0.
ℹ See details at ?tidyselect::faq-selection-context
My questions are: why does this happen, and what is the right way to invoke column names in a dplyr pipeline inside a function without raising any notes, warnings, or errors?
After a little investigation, I concluded that I was confusing two different things.
select
takes <tidy-select>
arguments, and the .data
pronoun is deprecated here.
filter
takes <data-masking>
arguments, and the .data
pronoun is not deprecated here.
So the answer is that you should use quote marks in select
, but you should use .data
in filter
. Thus:
# MINIMAL REPRODUCIBLE EXAMPLE OF TIDYSELECT COLUMN SELECTION LINTR ERROR
library(dplyr)
# Create some dummy sample data
data <- tibble::tibble(
name = c("John", "Jane", "Jim", "Jill")
)
example_function <- function(
data
) {
data |>
dplyr::mutate(.row_id = seq_len(dplyr::n())) |>
dplyr::select(".row_id", "name") |>
dplyr::filter(.data$.row_id == 1)
}
print(example_function(data))
This will evaluate correctly and will not raise any notes or warnings. Note that you have to import dplyr
as a separate step using either library(dplyr)
or #' @importFrom dplyr .data
(if this were a package function). Trying to access it like dplyr::.data$.row_id
will raise the following unhelpful error:
Error in `dplyr::filter()` at r-econid/R/experiment.R:11:3:
ℹ In argument: `dplyr::.data$.row_id == 1`.
Caused by error in `example_function()`:
! Can't subset `.data` outside of a data mask context.
Run `dplyr::last_trace()` to see where the error occurred.
Don't be misled by the error's implication that filter
is not a data mask context; it is. The problem here is with the way .data
is being accessed, from the dplyr namespace rather than the filter
context.
Other data-masking functions include group_by
and summarize
. Here you should use .data
, not quotes.
As for why these linter errors happen? That I haven't fully figured out, except for the speculation in the OP. But I thought my learnings on the difference between <tidy-select>
and <data-masking>
were instructive enough to share and record for posterity here.