rxtszoo

Delimit precipitation events from xts objects using `split.zoo`


I'd like to split an xts object of precipitation depths into chunks in order to be able to delimit individual events for subsequent analysis from a continuous record of observations.

Let's assume I'm working with this data:

datetime <- seq(as.POSIXct("2020-01-01 00:00", tz = "UTC"), 
                by = "1 min", 
                length.out = 1440)

vals <- rep(0, length(datetime))

x <- xts::xts(vals, order.by = datetime)


# fill xts object with some random values
# event #1
zoo::coredata(x["2020-01-01 00:30/2020-01-01 02:35"]) <- runif(126, min = 0.01, max = 0.2)

# event #2
zoo::coredata(x["2020-01-01 08:45/2020-01-01 12:50"]) <- runif(246, min = 0.01, max = 0.2)

# event #3
zoo::coredata(x["2020-01-01 17:15/2020-01-01 17:30"]) <- runif(16, min = 0.01, max = 0.2)
zoo::coredata(x["2020-01-01 18:15/2020-01-01 19:00"]) <- runif(46, min = 0.01, max = 0.2)
zoo::coredata(x["2020-01-01 22:30/2020-01-01 23:00"]) <- runif(31, min = 0.01, max = 0.2)

In order to delimit events, I'd like them to meet the following criteria:

The event starts with the first value > 0 and ends with the last value > 0 if there is no further precipitation recorded during the upcoming 4 hours.

From my research I'd need a vector of factors with length(x) and levels according to the "event id" to use this as input for split.zoo. Event #1 would be characterized by level = 1, #2 by 2, and so on. I'm not interested in precipitation breaks per se, so they could be simply mapped as 0 (or even dismissed at this point).

My expected result would be a list of zoo/xts objects encompassing the individual events:

# looking for `g` in the end, making use of some efficient rolling approach
split(x, g) |> str()
#> List of 4
#>  $ 0:'zoo' series from 2020-01-01 to 2020-01-01 23:59:00
#>   Data: num [1:722, 1] 0 0 0 0 0 0 0 0 0 0 ...
#>   Index:  POSIXct[1:722], format: "2020-01-01 00:00:00" "2020-01-01 00:01:00" ...
#>  $ 1:'zoo' series from 2020-01-01 00:30:00 to 2020-01-01 02:35:00
#>   Data: num [1:126, 1] 0.1737 0.0958 0.0491 0.1861 0.1877 ...
#>   Index:  POSIXct[1:126], format: "2020-01-01 00:30:00" "2020-01-01 00:31:00" ...
#>  $ 2:'zoo' series from 2020-01-01 08:45:00 to 2020-01-01 12:50:00
#>   Data: num [1:246, 1] 0.1136 0.1473 0.0433 0.1311 0.1741 ...
#>   Index:  POSIXct[1:246], format: "2020-01-01 08:45:00" "2020-01-01 08:46:00" ...
#>  $ 3:'zoo' series from 2020-01-01 17:15:00 to 2020-01-01 23:00:00
#>   Data: num [1:346, 1] 0.1614 0.0632 0.1216 0.1888 0.0967 ...
#>   Index:  POSIXct[1:346], format: "2020-01-01 17:15:00" "2020-01-01 17:16:00" ...

Since my real-world minutely data spans over decades, I'd prefer a fast approach. And ideally the solution could be applied on data of various temporal resolutions, i.e. also work with 5-minutely or hourly data.


Solution

  • 1) Create a function that trims 0's off the ends for later use. Then apply it to the input x giving xt and compute runs of 0's and non-0's and from that a grouping vector g to split by. Perform the split, trim the components and insert the 0 component at the beginning.

    # rm 0's from ends of x
    trim0 <- function(x) {
      wx <- which(x != 0)
      if (length(wx) == 0) x[wx] else x[seq(min(wx), max(wx))]
    }
    
    xt <- trim0(x)
    r <- rle(as.vector(xt) == 0)
    four_hrs <- 4 * 60 * 60 / deltat(x) # 240
    r$values <- cumsum(r$values & r$lengths >= four_hrs)
    g <- inverse.rle(r) + 1
    L <- c(list("0" = x), lapply(split(xt, g), trim0))
    
    str(L)
    

    giving

    List of 4
     $ 0:An xts object on 2020-01-01 / 2020-01-01 23:59:00 containing: 
      Data:    double [1440, 1]
      Index:   POSIXct,POSIXt [1440] (TZ: "UTC")
     $ 1:‘zoo’ series from 2020-01-01 00:30:00 to 2020-01-01 02:35:00
      Data: num [1:126] 0.0646 0.1598 0.0877 0.1778 0.1887 ...
      Index:  POSIXct[1:126], format: "2020-01-01 00:30:00" "2020-01-01 00:31:00" ...
     $ 2:‘zoo’ series from 2020-01-01 08:45:00 to 2020-01-01 12:50:00
      Data: num [1:246] 0.0393 0.0273 0.037 0.1411 0.1277 ...
      Index:  POSIXct[1:246], format: "2020-01-01 08:45:00" "2020-01-01 08:46:00" ...
     $ 3:‘zoo’ series from 2020-01-01 17:15:00 to 2020-01-01 23:00:00
      Data: num [1:346] 0.1746 0.0965 0.1114 0.1931 0.1572 ...
      Index:  POSIXct[1:346], format: "2020-01-01 17:15:00" "2020-01-01 17:16:00" ...
    

    2) This was my original attempt but ultimately found that (1) is a bit simpler. We can use na.trim, rle and split. First define functions to convert from 0 to NA and from NA to 0 for use later on. Then convert all zeroes to NA's and use na.trim to trim the NA's off the ends. Then find runs of more than 240 NA's (change this to something else if not using minutes), take the cumulative sum and convert back from runs to data to get the grouping vector g to split by. After splitting trim the NA's in each component and convert back from NA's to zeroes. Finally insert the original data as component 0.

    library(xts)
    
    zero2na <- function(x) replace(x, x == 0, NA)
    na2zero <- function(x) replace(x, is.na(x), 0)
    
    x_na <- x |> zero2na() |> na.trim()
    r <- rle(is.na(as.vector(x_na)))
    r$values <- cumsum(r$values & r$lengths >= 240)
    g <- inverse.rle(r) + 1
    L <- lapply(split(x_na, g), \(x) x |> na.trim() |> na2zero())
    L <- c(list("0" = x), L)
    
    # check
    str(L)