rdplyrtidyverse

How to count number of instances over a threshold value between two dates?


So I have two data frames I am working with. The first contains information on a cohort of kids. These same kids have been measured multiple times on different days (so uniqueID AD_1 indicates child AD receiving their first set of measurements, and AD_2 indicates the same child receiving their second set of measurements). We are interested in the date windows between these measurement days, and have made columns start_date and end_date to define the windows, like so:

uniqueID start_date end_date
AD_1 2/22/2023 2/24/2023
KJ_1 2/22/2023 2/26/2023
AD_2 2/24/2023 3/8/2023
KJ_2 2/26/2023 3/2/2023
AD_3 3/8/2023 3/10/2023
KJ_3 3/2/2023 3/9/2023

The second data frame contains a continuous range of dates, and maximum daily temperatures for each date:

date maxtemp
2/22/2023 23.3
2/23/2023 18.8
2/24/2023 25.2
2/25/2023 27.0
2/26/2023 19.1
2/27/2023 20.2
2/28/2023 20.7
3/1/2023 26.1
3/2/2023 27.2
3/3/2023 20.8
3/4/2023 22.4
3/5/2023 21.8
3/6/2023 23.9
3/7/2023 24.9
3/8/2023 25.1
3/9/2023 22.7
3/10/2023 25.5

What I need to be able to do is count the number of days over 25° for each growth window (i.e., between each start and end date) as specified in the first data frame, using the temperature data in the second data frame.

I attempted to do this using the following code:

df1 %>% 
  transmute(uniqueID, day = map2(start_date, end_date, seq, by = "1 day")) %>%
  group_by(uniqueID) %>%
  summarise(count = df2$maxtemp >= 25)

However, this seemed to just generate a massive list of T/F values based on whether each day in the date window had a maximum temperature over 25°. Is there a way to have it add up the occurrences of days over 25°, to generate a new column in the first data frame, like this?

uniqueID start_date end_date daysover25
AD_1 2/22/2023 2/24/2023 1
KJ_1 2/22/2023 2/26/2023 2
AD_2 2/24/2023 3/8/2023 5
KJ_2 2/26/2023 3/2/2023 2
AD_3 3/8/2023 3/10/2023 1
KJ_3 3/2/2023 3/9/2023 2

Solution

  • You can try

    df1 %>%
      left_join(
        df2 %>%
          filter(maxtemp >= 25),
        by = join_by(start_date <= date, end_date >= date)
      ) %>%
      reframe(daysover25 = n(), .by = uniqueID:end_date)
    

    which gives

      uniqueID start_date   end_date daysover25
    1     AD_1 2023-02-22 2023-02-24          1
    2     KJ_1 2023-02-22 2023-02-26          2
    3     AD_2 2023-02-24 2023-03-08          5
    4     KJ_2 2023-02-26 2023-03-02          2
    5     AD_3 2023-03-08 2023-03-10          2
    6     KJ_3 2023-03-02 2023-03-09          2
    

    data

    > dput(df1)
    structure(list(uniqueID = c("AD_1", "KJ_1", "AD_2", "KJ_2", "AD_3",
    "KJ_3"), start_date = structure(c(19410, 19410, 19412, 19414,
    19424, 19418), class = "Date"), end_date = structure(c(19412,
    19414, 19424, 19418, 19426, 19425), class = "Date")), class = "data.frame", row.names = c(NA,
    -6L))
    
    > dput(df2)
    structure(list(date = structure(c(19410, 19411, 19412, 19413,
    19414, 19415, 19416, 19417, 19418, 19419, 19420, 19421, 19422,
    19423, 19424, 19425, 19426), class = "Date"), maxtemp = c(23.3,
    18.8, 25.2, 27, 19.1, 20.2, 20.7, 26.1, 27.2, 20.8, 22.4, 21.8,
    23.9, 24.9, 25.1, 22.7, 25.5)), class = "data.frame", row.names = c(NA,
    -17L))