r performance bigdata processing-efficiency

Computationally faster alternatives to calculate a new variable based on multiple columns from two large data frames in R

I have two datasets, df1 and df2.

(1) df1 has about 7 million rows and 57 columns. The columns of interest in df1 for this calculation are "county", "start.date", and "end.date".

(2) df2 has about 12 million rows and 12 columns. The columns of interest in df2 for this calculation are "county", "date", and "measure".

For each row i in df1, I need to calculate the mean of all df2$measure[j] that satisfy two conditions:

(a) df2$date[j] is between df1$start.date[i] and df1$end.date[i], and

(b) df2$county[j] matches df1$county[i]

This mean.measure will be added as a new column to df1 for further analysis.

My major problem is computational speed.

My code worked on smaller subsets of data, but when applied full-scale, they took way too long. My parallel computing session (8 CPUs, 8 GB memory) was limited to 8 hours, and even after 8 hours, it was not done yet.

I would truly appreciate your suggestions on how to make this faster, beside using a stronger computer system. Thank you so much in advance!

Below is my code:

# Create function to extract mean measure
mean.measure <- function(x,y,z) { # x = county; y = start date; z = end date
  ifelse(is.na(x) | is.na(y) | is.na(z), return(NA), 
     ifelse (y >= z, return("Start date >= end date"),
             ifelse (!(x %in% df2$county), return("Nonexistent county"),
                     ifelse(z > max(df2$date) | y < min(df2$date), return("Out of df2 date range"),
                            temp <- df2 %>%
                              filter(county == x & date >= y & date < z) %>% # filter out only the time range and county of interest
                              summarise(mean = mean(measure, na.rm=TRUE)) # ignore NA values for measure
                     )
             )
         )
    )
 return(temp$mean)
}


# Vectorize mean.measure()
v_mean.measure <- Vectorize(mean.measure)

# Calculate mean measure and add to df1
df1 <- df1 %>%
  mutate(mean.measure = v_mean.measure(county,start.date,end.date))

Solution

Start by sorting df2 by county then date. Then add an index column (r in the code below) and a column containing the cumulative sum of measure (cs below). This will allow us to perform two fast data.table non-equi joins to get the start and end rows in df2 that need to be averaged for each row of df1. Use those row indices to get the difference in the cumulative sum of measure and then divide by the number of rows making up that difference.

Using some dummy data with 7M rows in df1 and 12M rows in df2, the computation takes only a few seconds.

library(data.table)

system.time(
  setDT(df1)[
    , `:=`(
      r1 = setorder(
        setDT(df2),
        county, date
      )[
        , `:=`(r = .I, cs = cumsum(measure))
      ][
        df1,
        r,
        on = .(county == county, date >= start.date),
        mult = "first"
      ],
      r2 = df2[
        df1,
        r,
        on = .(county == county, date < end.date),
        mult = "last"
      ]
    )
  ][
    , `:=`(
      mean.measure = (df2$cs[r2] - df2$cs[r1])/(r2 - r1 + 1L),
      r1 = NULL,
      r2 = NULL
    )
  ]
)
#>    user  system elapsed 
#>    5.28    0.31    2.66

Data:

df1 <- data.frame(
  county = sample(LETTERS[1:25], 7e6, 1),
  start.date = sample(seq.Date(as.Date("2010-01-01"), as.Date("2020-12-31"), "days"), 7e6, 1)
)
df1$end.date <- df1$start.date + sample(1e3, 7e6, 1)
df2 <- data.frame(
  county = sample(LETTERS, 12e6, 1),
  date = sample(seq.Date(as.Date("2010-06-01"), as.Date("2020-07-31"), "days"), 12e6, 1),
  measure = runif(12e6)
)