rdatabricksdifftime

Last value manually inputted in an r script difftime output


R coding in Databricks.

I want the time intervals in hours between date1 entries, arranged by pid, med and date1.

I want the latest date1 entry in a sequence of events per date to be manually adjustable to 24 hours.

A cohort is when pid, med and date1 is the same.
Any change will terminate the last cohort with an hour_output == 24.

df

  pid       med                date1                    
1  1       drugA             2023-02-02 09:00:00         
2  1       drugA             2023-02-02 12:00:00         
3  1       drugA             2023-02-02 14:00:00        
4  1       drugB             2023-02-03 10:00:00         
5  1       drugB             2023-02-03 18:00:00 

What script was attempted.

df1 <- df %>%
  arrange(pid, med, date1) %>%
  mutate(hours_output = as.numeric(difftime(lead(date1), date1, units = "hours")))

# Replace the last duration value with 24 hours
df1$hours_output[last(nrow(df1))] <- 24

df1 <- df1 %>% select(med, date1, hours_output)
head(df1)

Actual output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        20.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        18.00 (18 hours to the next row - not shown)

Desired output

  pid       med                date1                    hours_output
1  1       drugA             2023-02-02 09:00:00         3.00
2  1       drugA             2023-02-02 12:00:00         2.00
3  1       drugA             2023-02-02 14:00:00        24.00
4  1       drugB             2023-02-03 10:00:00         8.00
5  1       drugB             2023-02-03 18:00:00        24.00

Solution

  • The following works on the dummy data (as at top of question) when used in Databricks.

    Gives desired output.

    (Also, works on actual data in Databricks)

    library(dplyr)
    library(lubridate)
    
    # Convert datetime column to POSIXct object
    df$date1 <- ymd_hms(df$date1)
    
    df <- df %>% arrange(date1)
    
    # Calculate duration between consecutive datetime values, including last interval
    durations <- c(diff(df$date1), 0)
    
    # Convert durations to hours and round to 2 decimal places
    durations <- round(as.numeric(durations, units = "hours"), 2)
    
    # Replace any negative values with 0
    durations[durations < 0] <- 0
    
    # Find last timestamp for each date and replace duration with 24 hours
    last_times <- dc_4 %>%
      group_by(Date = as.Date(date1)) %>%
      slice_tail(n = 1) %>%
      ungroup()
    
    durations[df$date1 %in% last_times$date1] <- 24
    
    df$duration <- durations
    
    df1 <- df %>% select(date1, duration)
    
    head(df1, 10)