rfunctiontimeiterationcalculation

Find max time difference within each year in R


I have a function that calculates the average, min and max values for each year in my dataframe, then merges them to output the alltime average, min and max values. Each year needs to be calculated separately first because my dates only refer to the months of April through August. If I didn't group it by year, there would be calculations between August of one year and April of the next year. I want to avoid this.

Example dataframe:

date            NDVI        cloud_cover    field_id

23/04/2017      0.6494          12           KM60        
23/04/2017      0.5683          0            KM1
05/05/2017      0.3467          0            KM60
31/07/2017      0.6743          05           KM60
31/07/2017        NA            97           KM1
31/07/2017      0.3456          07           LM27
01/04/2018        NA            100          KM60
03/06/2018      0.6743          11           KM60
03/06/2018      0.2346          12           KM1
04/05/2019        NA            99           KM60
05/05/2019      0.5432          20           KM60

NDVI and cloud_cover shouldn't influence calculations. Although field_ids most times provide the same dates, this also shouldn't influence them.

This is the current code:

calculate_time_diff <- function(df) {

  # Convert "date" column to datetime
  df$date <- as.POSIXct(df$date)
  
  # Group the data by year
  df_calc <- split(df, format(df$date, "%Y"))
  
  # Calculate time differences between consecutive observations for each year
  time_diffs <- lapply(df_calc, function(group) {
    # Sort dataframe based on "date"
    group <- group[order(group$date), ]
    
    # Filter out duplicate dates
    group <- group[!duplicated(group$date), ]
    
    # Calculate time differences between consecutive observations
    diff(group$date)
  })
  
  # Combine time differences from all years into a single vector
  all_time_diffs <- unlist(time_diffs)
  
  # Compute average time difference
  avg_time_diff <- mean(all_time_diffs)
  
  # Calculate smallest and biggest time differences
  smallest_time_diff <- min(all_time_diffs)
  biggest_time_diff <- max(all_time_diffs)
  
  return(list(avg_time_diff = avg_time_diff,
              smallest_time_diff = smallest_time_diff,
              biggest_time_diff = biggest_time_diff))
}

The output is giving me "240" as max time difference, which I know to be unrealistic. My dataframe refers to the revisit dates of three satellites and none of them should be more than at the very most a month apart.

I thought it could have something to do with the way years are being extracted, but this user seems to have successfully used format() just as I did. lapply() should iterate through each split year group in the same way as group_by(). So what could be the problem in my script?


Solution

  • Using dplyr:

    data %>%
      distinct(date) %>% #remove duplicates
      arrange(date) %>% #order by date
      group_by(format(date, "%Y")) %>% #group by year
      reframe(dateDiff = diff(date)) %>% #apply 'diff' to every group
      with(list(avg_time_diff = mean(dateDiff),
                smallest_time_diff = min(dateDiff),
                biggest_time_diff = max(dateDiff))) #create your metrics
    

    Result:

    $avg_time_diff
    Time difference of 30.02198 days
    
    $smallest_time_diff
    Time difference of 12 days
    
    $biggest_time_diff
    Time difference of 51 days
    

    Dummy data:

    data <- data.frame(date = seq(as.Date("2017-01-01"), by = "month", length.out = 100) + sample(0:20, 100, TRUE))