in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series
) and a sample of pairs individual-date (sample_events
) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp
conditional on individual and date in the sample_events
.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary
then outputs the requested average.
To repeat the operation for each row in sample_events
I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table
, you could join exp_series
with the range you wish in sample_dates
and calculate mean by=.EACHI
:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1
only for Event_id
without duplicates in sample_dates
.
The results are different in case of duplicates, for instance Event_id==1002
:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.