rdata.table

Is there an equivalent of dplyr data pronouns in data.table?


Is there a way to tell data.table to look for an external variable instead of a column name, just like what you can do with the .env pronoun in dplyr? Imagine you have a dataframe with the column name and a variable with the same name, how do you distinguish? Have a look at the following example:

animalDf <- data.frame(
  animal = c("snail", "spider", "bear"),
  legs = c(0, 8, 4)
)
animal <- "spider"
animalDf |> 
  dplyr::filter(.data$animal == .env$animal) |> 
  dplyr::pull(legs)
# I get the correct result: 8
animalDt <- data.table::as.data.table(animalDf)
animalDt[animal == animal, legs] # obviously does not work

In functions I might not be able to control all names of the data.table, so it will be very important to be able to distinguish and tell explicitly that the environment variable shall be used.


Solution

  • Using env

    We can use env to dynamically subset. This was introduced in data.table v1.15.0 (Jan 2024) and is described in the Programming on data.table vignette.

    Note that because in this case we want to provide the actual character value, i.e. "spider", rather than a column called spider, we wrap it in the I() function. As the docs note:

    The I function marks an object as AsIs, preventing its arguments from character-to-symbol automatic conversion.

    This is only required for character columns - see this similar question with a numeric column, where I() is not required.

    animalDt[
        animal == animal_var,
        legs,
        env = list(animal_var = I(animal))
    ]
    # [1] 8
    

    Alternative approach: the .. prefix

    Alternatively, in this instance, you can use the .. prefix to refer to objects in the parent environment. As the data.table vignettes note:

    For those familiar with the Unix terminal, the .. prefix should be reminiscent of the “up-one-level” command, which is analogous to what’s happening here – the .. signals to data.table to look for the select_cols variable “up-one-level”, i.e., within the global environment in this case.

    animalDt[, legs[animal == ..animal]]
    # [1] 8
    

    I think the vignette is actually a little conservative as .. can access variables which are more than one level up if necessary, otherwise the following would not work:

    f <- function(dt) {
        g <- function(dt) dt[, legs[animal == ..animal]]
        g(dt)
    }
    
    f(animalDt)
    # [1] 8
    

    This is not a good way to write a function (animal should be a parameter) but .. is under the hood doing get0("animal", parent.frame()). This means it will be able to access animal if it exists in frames enclosing the parent frame, such as the global environment.

    However, note that we are subsetting the legs column, i.e. making a copy at certain indices, which with a very large data.table could be slow.

    This is because we can only use .. in j but not in i, i.e. this does not work:

    animalDt[animal == ..animal, legs]
    # Error in eval(stub[[3L]], x, enclos) : object '..animal' not found
    

    Personally, I find .. more readable, and if performance is not a large concern I would use it. However, for a more generalisable and performant approach, env is the way to go.

    Other approaches are retired

    It is also possible to instead use get(), mget() or eval() here (as it done in the accepted answer to the similar question) but as Friede states in the comments, these approaches have now been retired in data.table in favour of env.