rdata.table

data.table handles the ordering of `.SD` and `.SDcols` with the `by` parameter


I'm a newbie to data.table. I'm curious as to when the .SDcols parameter content was processed in the case below? As per the documentation, the value information should not be passed in .SD, and since I have only provided v1 data in .SDcols. So, theoretically it would report an error only? I'm not really understanding.

library(data.table)

dt <- data.table(
  group = c("A", "A", "B", "B", "B"),
  value = c(3, 6, 1, 2, 4),
  v1 = c(1,2,3,4,5)
)
dt[, .SD[value == min(value)], by = group, .SDcols = "v1"]
#>     group    v1
#>    <char> <num>
#> 1:      A     1
#> 2:      B     3

Created on 2025-06-25 with reprex v2.1.1

One way I would guess to handle this is:

  1. grouping is done based on by first
  2. did a row filter based on the information in .SD
  3. extracted the column data provided in .SDcols

Looking forward to the clarification, thanks!


Solution

  • Let's see if we can dive into the process step by step

    Content of .SD by group

    dt[, by=group,.SD, .SDcols = "v1"]
    
        group    v1
       <char> <num>
    1:      A     1
    2:      A     2
    3:      B     3
    4:      B     4
    5:      B     5
    

    OK normal, lets add value now.

    dt[, by=group, cbind(value, .SD), .SDcols = "v1"]
    
        group value    v1
       <char> <num> <num>
    1:      A     3     1
    2:      A     6     2
    3:      B     1     3
    4:      B     2     4
    5:      B     4     5
    

    Being able to do that means that columns are available as well as . SD in J scope. Let's add filter condition.

    dt[, by=group, cbind(filter=value==min(value), .SD), .SDcols = "v1"]
    
        group filter    v1
       <char> <lgcl> <num>
    1:      A   TRUE     1
    2:      A  FALSE     2
    3:      B   TRUE     3
    4:      B  FALSE     4
    5:      B  FALSE     5
    

    Pretty easy to see what's going to happen now :-)

    dt[, .SD[value == min(value)], by = group, .SDcols = "v1"]
    
        group    v1
       <char> <num>
    1:      A     1
    2:      B     3
    

    So it's more

    1. grouping is done based on by first

    2. .SD is built from current group row subset keeping only .SDcols and "added" to it