I am working with water quality data in which I have sampling location, the dates of the samples taken and the result of each sample. Whenever a sample result is greater than 104, it triggers an advisory. The single advisory is counted once but lasts as many days it takes until new sample data reflects a result equal to or lower than 104. The length of the advisory counts starts with the first day the advisory was triggered and counts up to, but does not include, the date the result is equal to or lower than 104. I am looking for a method to count the total number of advisories per location and the average length of an advisory (in days). Thanks!!
There are a few steps you need to make here. Firstly, your dates appear to be stored as a character vector rather than actual date objects, so you need to convert these. Next, you need to ensure that the readings are arranged in order for each site.
You will need to include a column that tells you when the next reading for each site was made. Then, you need to find the consecutive readings at which the values are greater than 104. You can then group these together using consecutive_ids
. Finally, you can filter out the non-advisory readings and summarize the data frame to give only the minimum date and maximum 'next date' of each group of advisory readings. It is trivial to then summarize this into a data frame of counts and mean advisory lengths.
library(tidyverse)
df %>%
mutate(DATE = lubridate::mdy(DATE)) %>%
arrange(SITE_ID, DATE) %>%
group_by(SITE_ID) %>%
mutate(advisory = RESULT >= 104) %>%
mutate(next_reading = lead(DATE)) %>%
mutate(ad_block = consecutive_id(advisory)) %>%
filter(advisory) %>%
group_by(SITE_ID, ad_block) %>%
summarize(start = min(DATE), stop = max(next_reading), .groups = 'drop') %>%
mutate(length = as.numeric(difftime(stop, start, unit = 'days'))) %>%
summarise(advisories = n(),
mean_duration = mean(length, na.rm = TRUE),
.by = "SITE_ID")
#> # A tibble: 3 x 3
#> SITE_ID advisories mean_duration
#> <chr> <int> <dbl>
#> 1 ARAA01 2 17.5
#> 2 BRAA01 2 23
#> 3 CRAA01 3 10.3
Note that the site BRAA01 ends on an advisory if we order the dates properly. We do not know how long this advisory was, so we have only been able to use the first advisory period at that site in the calculation.
Data transcribed from image in question
Anyone that wants to answer this question will have to transcribe your picture into useable text, which is why we ask you not to post images of data, but rather include reproducible data in questions. I have used OCR to transcribe your data on this occasion.
df <- data.frame(
SITE_ID = c("ARAA01", "ARAA01", "ARAA01", "ARAA01", "ARAA01", "ARAA01",
"ARAA01", "BRAA01", "BRAA01", "BRAA01", "BRAA01", "BRAA01",
"BRAA01", "BRAA01", "CRAA01", "CRAA01", "CRAA01", "CRAA01",
"CRAA01", "CRAA01", "CRAA01", "CRAA01", "CRAA01"),
RESULT = c(13, 107, 10, 115, 120, 110, 80, 5, 5, 112, 105, 5, 118, 5, 154,
180, 2000, 1543, 103, 180, 80, 112, 5),
DATE = c("1/3/2023", "1/17/2023", "1/30/2023", "2/13/2023", "2/27/2023",
"3/6/2023", "3/7/2023", "1/3/2023", "1/18/2023", "1/25/2023",
"2/7/2023", "2/17/2023", "3/28/2023", "3/7/2023", "1/2/2023",
"1/5/2023", "1/6/2023", "1/7/2023", "1/9/2023", "2/6/2023",
"2/17/2023", "2/21/2023", "3/6/2023"))