rdata.tabletidyverserunner

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?


in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data. So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.

With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.

I did this in 2 different ways:

1: a simple but effective code that gets the job done very quickly I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.

To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.

#Sample data
set.seed(111)
exp_series <- data.frame(
  id = as.character(rep(1:10000, each=2191)), 
  date = rep(seq(as.Date('2005-01-01'),
                 as.Date('2010-12-31'), by = 'day'),times=10000),
  exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)


sample_dates <- data.frame(
  Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))), 
  Event_date = sample(
    seq(as.Date('2005-01-01'),
        as.Date('2010-12-31'), by = 'day'),
    size =10000,replace = TRUE)
)



#This function, given a dataframe with dates and exposure series (df) 
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
    df<-as.data.table(df)
    end<-as.character(as.Date(event_date)-lag)
    start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
    return(mean(df[date %between% c(start,end)]$exp))
}

#Nest dataframes
exp_series_nest <- exp_series %>% 
  group_by(id) %>% 
  nest()


#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)


#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id

#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
  summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})

2: using the highly-flexible package runner

With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.

system.time(summaries2 <- sample_dates %>%
  group_by(Event_id) %>%
  mutate(
    mean = runner(
      x = exp_series[exp_series$id ==  Event_id[1],], 
      k = "365 days", 
      lag = "1 days",
      idx =exp_series$date[exp_series$id == Event_id[1]],
      at = Event_date,
      f = function(x) {mean(x$exp)},
      na_pad=FALSE
    )
  )    
)

They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.

My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.

Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.


Solution

  • With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:

    library(data.table)
    
    setDT(exp_series)
    setDT(sample_dates)
    
    
    lag <- 1
    width <- 365 
    # Define range
    sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
    
    # Calculate mean by .EACHI
    summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
                                          ,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
                                          ,.(id,mean)]
    

    Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.

    The results are different in case of duplicates, for instance Event_id==1002:

    sample_dates[Event_id==1002]
       Event_id Event_date      begin        end
         <char>     <Date>     <Date>     <Date>
    1:     1002 2010-08-17 2009-08-16 2010-08-16
    2:     1002 2010-06-23 2009-06-22 2010-06-22
    

    If you don't have duplicates in your real data, this shouldn't be a problem.