rdataframer-faqdollar-sign

Dynamically select data frame columns using $ and a character value


I have a vector of different column names and I want to be able to loop over each of them to extract that column from a data.frame. For example, consider the data set mtcars and some variable names stored in a character vector cols. When I try to select a variable from mtcars using a dynamic subset of cols, nether of these work

cols <- c("mpg", "cyl", "am")
col <- cols[1]
col
# [1] "mpg"

mtcars$col
# NULL
mtcars$cols[1]
# NULL

how can I get these to return the same values as

mtcars$mpg

Furthermore how can I loop over all the columns in cols to get the values in some sort of loop.

for(x in seq_along(cols)) {
   value <- mtcars[ order(mtcars$cols[x]), ]
}

Solution

  • You can't do that kind of subsetting with $. In the source code (R/src/main/subset.c) it states:

    /*The $ subset operator.
    We need to be sure to only evaluate the first argument.
    The second will be a symbol that needs to be matched, not evaluated.
    */

    Second argument? What?! You have to realise that $, like everything else in R, (including for instance ( , + , ^ etc) is a function, that takes arguments and is evaluated. df$V1 could be rewritten as

    `$`(df , V1)
    

    or indeed

    `$`(df , "V1")
    

    But...

    `$`(df , paste0("V1") )
    

    ...for instance will never work, nor will anything else that must first be evaluated in the second argument. You may only pass a string which is never evaluated.

    Instead use [ (or [[ if you want to extract only a single column as a vector).

    For example,

    var <- "mpg"
    #Doesn't work
    mtcars$var
    #These both work, but note that what they return is different
    # the first is a vector, the second is a data.frame
    mtcars[[var]]
    mtcars[var]
    

    You can perform the ordering without loops, using do.call to construct the call to order. Here is a reproducible example below:

    #  set seed for reproducibility
    set.seed(123)
    df <- data.frame( col1 = sample(5,10,repl=T) , col2 = sample(5,10,repl=T) , col3 = sample(5,10,repl=T) )
    
    #  We want to sort by 'col3' then by 'col1'
    sort_list <- c("col3","col1")
    
    #  Use 'do.call' to call order. Seccond argument in do.call is a list of arguments
    #  to pass to the first argument, in this case 'order'.
    #  Since  a data.frame is really a list, we just subset the data.frame
    #  according to the columns we want to sort in, in that order
    df[ do.call( order , df[ , match( sort_list , names(df) ) ]  ) , ]
    
       col1 col2 col3
    10    3    5    1
    9     3    2    2
    7     3    2    3
    8     5    1    3
    6     1    5    4
    3     3    4    4
    2     4    3    4
    5     5    1    4
    1     2    5    5
    4     5    3    5