Is there a way to tell data.table to look for an external variable instead of a column name, just like what you can do with the .env
pronoun in dplyr
?
Imagine you have a dataframe with the column name and a variable with the same name, how do you distinguish?
Have a look at the following example:
animalDf <- data.frame(
animal = c("snail", "spider", "bear"),
legs = c(0, 8, 4)
)
animal <- "spider"
animalDf |>
dplyr::filter(.data$animal == .env$animal) |>
dplyr::pull(legs)
# I get the correct result: 8
animalDt <- data.table::as.data.table(animalDf)
animalDt[animal == animal, legs] # obviously does not work
In functions I might not be able to control all names of the data.table, so it will be very important to be able to distinguish and tell explicitly that the environment variable shall be used.
env
We can use env
to dynamically subset. This was introduced in data.table v1.15.0
(Jan 2024) and is described in the Programming on data.table vignette.
Note that because in this case we want to provide the actual character value, i.e. "spider"
, rather than a column called spider
, we wrap it in the I()
function. As the docs note:
The
I
function marks an object as AsIs, preventing its arguments from character-to-symbol automatic conversion.
This is only required for character columns - see this similar question with a numeric column, where I()
is not required.
animalDt[
animal == animal_var,
legs,
env = list(animal_var = I(animal))
]
# [1] 8
..
prefixAlternatively, in this instance, you can use the ..
prefix to refer to objects in the parent environment. As the data.table
vignettes note:
For those familiar with the Unix terminal, the
..
prefix should be reminiscent of the “up-one-level” command, which is analogous to what’s happening here – the..
signals todata.table
to look for the select_cols variable “up-one-level”, i.e., within the global environment in this case.
animalDt[, legs[animal == ..animal]]
# [1] 8
I think the vignette is actually a little conservative as ..
can access variables which are more than one level up if necessary, otherwise the following would not work:
f <- function(dt) {
g <- function(dt) dt[, legs[animal == ..animal]]
g(dt)
}
f(animalDt)
# [1] 8
This is not a good way to write a function (animal
should be a parameter) but ..
is under the hood doing get0("animal", parent.frame())
. This means it will be able to access animal
if it exists in frames enclosing the parent frame, such as the global environment.
However, note that we are subsetting the legs
column, i.e. making a copy at certain indices, which with a very large data.table
could be slow.
This is because we can only use ..
in j
but not in i
, i.e. this does not work:
animalDt[animal == ..animal, legs]
# Error in eval(stub[[3L]], x, enclos) : object '..animal' not found
Personally, I find ..
more readable, and if performance is not a large concern I would use it. However, for a more generalisable and performant approach, env
is the way to go.
It is also possible to instead use get()
, mget()
or eval()
here (as it done in the accepted answer to the similar question) but as Friede states in the comments, these approaches have now been retired in data.table
in favour of env
.