rdata.tablesubset

Complex subsetting in R for case-control study


I am trying to perform a complex subsetting of data in R. I work in epidemiology on cohort studies and I need to run some simulations to show performances of nested case-control designs.

In order to do this 1) I need to create a full time-to-event dataset. That is, a data frame in which each row corresponds to a subject(ID)-period. In my example (data), the first two rows correspond to ID 1 for period 0-1 and 1-2. In the last period, each ID can either experience the event (Event=1) or not (Event=0). If in a specific period, the ID experiences the event it is called a case, otherwise a control. For each period I have data on a specific exposure, plus additional time-invariant information, in this case the sex of the subject. This is the form of an extended survival dataset with time-varying exposure/treatment. In this case the dataset is already provided: data, but I can post the simulation code if anyone is interested.

At a second stage, 2) I need to define a nested case-control dataset. This can performed by selecting, for each case, a sample of controls (the size is the variable set) matched by period and other variables if needed. To match by period it is sufficient to match by the "Stop" column, that is, the end of the period. In addition, in my example I also match by "sex". In general, it may happen that there are multiple cases for each period. In the example, for "Stop"=2 and "sex"=0, there are two cases! Therefore, I will need to sample twice the size of controls.

As follows, I show a simulated time-to-event dataset in subject-period format (as data.table) and my solution to define a nested case-control dataset (ncc).

Current Attempt

library(data.table)

#Convert data into data.table 
setDT(data)

## Define relevant information for risk sets
#1) Matching variables: "Stop" and "sex"
#2) Count number of cases per risk set
infocase <- data[Event==1,.(count = .N), by=c("Stop","sex")][
  order(Stop,sex)] # ordering is just convenient

# Define risk set size
set<-2

# Loop trhough each case-Stop-sex triples 
ncc <- lapply(seq(nrow(case)), function(i) {

  x <- infocase[i,]
  
  # Sample controls based on "Stop" and "sex" value
  controls<-data[Event==0 & Stop==x$Stop & sex==x$sex][sample(.N, min(set*x$count,.N))]
  
  # Create full risk set with cases and controls. Give it an identifier
  fullset<-rbind(controls,data[Event==1 & Stop==x$Stop & sex==x$sex])
  fullset$riskset<-i
  fullset
})

# Bind all risk sets together
ncc <- rbindlist(ncc)

My question is: the data I work with may consist of several millions rows and therefore a lapply (or a for loop) might not be the best way to approach the problem. I am wondering if someone can help define a more efficient and memory-saving solution :)


Data

data <- structure(list(Id = c(1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6),
                       Event = c(0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1), 
                       Start = c(0, 1, 0, 1, 2, 0, 0, 1, 0, 1, 2, 0, 1), 
                       Stop = c(1, 2, 1, 2, 3, 1, 1, 2, 1, 2, 3, 1, 2), 
                       sex = c(1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0),
                       exposure = c(10.89, 6.54, 14.58, 11.82, 9.5, 9.53, 13.9, 9.54, 7.48, 5.4, 3.8, 12.78, 10.97)),
                  row.names = c(1L, 2L, 6L, 7L, 8L, 11L, 16L,  17L, 21L, 22L, 23L, 26L, 27L, 31L),  
class = "data.frame") 

Solution

  • A non-equi join will be more performant. The trick is to shuffle the Event == 0 rows first (so the sample will be random), then use rowid (divided by set) so we can non-equi join on count.

    Updated with the new example data.

    case <- data[Event==1, .(count = .N), keyby = c("Stop","sex")]
    set <- 2
    
    setorder(
      rbindlist(
        list(
          data[Event == 0][sample(.N)][,i := rowid(Stop, sex)/set][
            case, on = .(Stop = Stop, sex = sex, i <= count), nomatch = 0
          ][,i := NULL],
          data[Event == 1]
        )
      ),
      Stop, sex, Event, Id
    )[,riskset := rleid(Stop, sex)][]
    #>       Id Event Start  Stop   sex exposure riskset
    #>    <num> <num> <num> <num> <num>    <num>   <int>
    #> 1:     1     0     0     1     1    10.89       1
    #> 2:     5     0     0     1     1     7.48       1
    #> 3:     3     1     0     1     1     9.53       1
    #> 4:     2     0     1     2     0    11.82       2
    #> 5:     4     1     1     2     0     9.54       2
    #> 6:     6     1     1     2     0    10.97       2
    #> 7:     5     0     1     2     1     5.40       3
    #> 8:     1     1     1     2     1     6.54       3
    #> 9:     5     1     2     3     1     3.80       4