rggplot2time-serieshighlight

Identifying start and end time and duration in minutes of highlighted area in ggplot


I have highlighted areas in a time series plot where variable 1 is greater than 28.3m and in a separate plot where variable 2 is between 335 and 390 and then stacked these plots on top of each other. I want to identify the start and end times, and the duration in minutes of each of the highlighted areas, and put that information into a dataframe. Is there a way to do this in R?

Code for highlighting areas on the ggplots

# Load packages
library(xts)
library(ggplot2)
library(gridExtra)

# Create time series data every 15 minutes for 2 different variables
start_time <- as.POSIXct("2024-05-01 00:00:00")
end_time <- as.POSIXct("2024-05-02 00:00:00")
time_seq <- seq(from = start_time, to = end_time, by = "15 min")

# Variable 1
variable1 <- rnorm(length(time_seq), mean = 25, sd = 5)
variable1
str(variable1)

# Variable 2
variable2 <- rnorm(length(time_seq), mean = 350, sd = 20)

# Create dataframe of the 2 variables
df <- data.frame(DateTime = time_seq, Variable1 = variable1, Variable2 = variable2)
head(df)
str(df)

# Highlight areas on the plot where Variable IS GREATER THAN 28.3

plot_var1 <- ggplot(df, aes(x = DateTime, y = Variable1))+
  geom_line(color = "blue") +
  geom_rect(data = subset(df, Variable1 > 28.3),
            aes(xmin = DateTime-450, xmax = DateTime+450, ymin = -Inf, ymax = Inf),
            fill = "lightblue", alpha = 0.3) +
  labs(x = "Time", y = "Variable 1")
plot_var1

# Highlight areas on the plot where Variable 2 is between 335 and 390
plot_var2 <- ggplot(df, aes(x = DateTime, y = Variable2)) +
  geom_line(color = "red") +
  geom_rect(data = subset(df, Variable2 > 335 & Variable2 < 390),
            aes(xmin = DateTime-450, xmax = DateTime+450, ymin = -Inf, ymax = Inf),
            fill = "lightpink", alpha = 0.3) +
  labs(x = "Time", y = "Variable 2") +
  theme_minimal()
plot_var2

# Arrange plots in one column and align by x-axis
Comb_Var1_Var2_Highlight_Plot<-grid.arrange(plot_var1, plot_var2, ncol = 1)

Solution

  • You can find the length of each sequence of rows where some condition is true with rle(). Then find the indices of the rows in the dataframe that are at the start and end of each sequence using the cumulative sum of those indices. Here's a function that takes a dataframe and a condition and returns that start, end and duration. Note that it assumes that if there is no difference between start and end the duration is 0, which in your case, assuming 15 minute intervals should perhaps be 15, not sure.

    get_duration <- function(df, condition){
      rl_var <- df %>% 
        mutate (tf = eval(parse(text=condition))) %>% 
        pull(tf) %>%  
        rle()
      
      i <- cumsum(rl_var$lengths)[rl_var$values == 1]
      l <- rl_var$lengths[rl_var$values == 1]
      
      starts <- i - l + 1
      ends <- i
      
      result <- data.frame(
        start = df$DateTime[starts],
        end = df$DateTime[ends],
        duration = df$DateTime[ends] - df$DateTime[starts]
      )
      return(result)
    }
    
    condition1 <- "Variable1 > 28.3"
    df %>% mutate (condition1 = eval(parse(text = condition1)))
    get_duration(df, condition)
    
    condition2 <- "between(Variable2, 335, 390)"
    df %>% mutate (condition2 = eval(parse(text = condition2)))
    get_duration(df, condition2)