rdata.tablelarge-datanon-linear-regressionloess

LOESS on very large dataset


I'm working with a very large dataset containing CWD (Cumulative Water Deficit) and EVI (Enhanced Vegetation Index) measurements across different landcover types. The current code uses LOESS regression to model the relationship between these variables, but it's extremely slow - taking more than 5 days to run and still not completed.

Here's a snippet of my current approach:

Loess_model <- tryCatch({
  loess(EVI ~ cwd, data = filtered_data, span = 0.5)
}, error = function(e) {
  print(paste("LOESS fitting failed for landcover:",
              landcover_val, "rp_group:", rp_group_val))
  print(paste("Error:", e))
  return(NULL)
})

I'm processing multiple landcover-return period groups in parallel (using the future package), but even with parallelization, the computational time is prohibitive. Some of my datasets contain over 1 000 000 observations for a single group.

I've already:

What alternatives would you recommend for:

My main goal as shown if this example enter image description here is to identify thresholds across different landcover types and drought return periods, so I need a smoothing approach that can capture the non-linear relationship effectively.

Has anyone tackled a similar problem or can recommend alternatives to standard LOESS that would be more computationally efficient?


Solution

  • Use lowess built-in to R, or Hmisc::movStats. movStats uses data.table to efficiently compute smooth estimates using moving overlapping windows of x. Here is an example, with timings.

    Side comment: Smoothing is better at showing that thresholds don't exist than it is for finding useful thresholds, since to be useful, relationships need to be flat on both sides of the threshold, which doesn't occur in nature very often.

    require(Hmisc)
    require(ggplot2)
    
    set.seed(1)
    n <- 1000000
    x <- runif(n)
    y <- x ^ 2 + runif(n)
    system.time(f <- lowess(x, y))   # 1.3s
    
    system.time(m <- movStats(y ~ x, melt=TRUE))   # 0.4s
    ggplot(m, aes(x=x, y=y, color=Statistic)) + geom_line()
    

    enter image description here