I'm working with a dataset where each row represents where an individual used services. This implicitly also tracks whether someone uses services, because if they did not, the column value for the month is NA
. I want to identify cases where there are breaks (periods of absence followed by return) in an individual's service based on transitions between consecutive month columns.
In other words, I am specifically interested in identifying when someone goes from using services (there are values in a month's column) to not using services (the value of 1+ subsequent month is NA
) and then back to using services (values in 1+ subsequent month columns, preceded by NA
s). My hope is to have a binary TRUE/FALSE
column ("Breaks_in_Service"). When someone begins services does not matter (meaning, NA
s only matter after the first month-column with a value in it).
Here's a simplified version of my dataset:
# Sample Data
simp_2021 <- data.frame(
ID = c(1, 2, 3, 4, 5),
jan21_ORG_NAME = c("Org A", NA, NA, "Org B", "Org B"),
feb21_ORG_NAME = c(NA, "Org A", "Org B", NA, "Org B"),
mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA),
apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")
)
# Initialize Breaks_in_Service column as FALSE
simp_2021$Breaks_in_Service <- FALSE
# View
print(simp_2021)
Expected output: In the sample data, Breaks_in_Service should be TRUE
for IDs 1, 4, and 5, and FALSE
for IDs 2 and 3.
I've tried building a for
loop, but it's gotten messy and isn't working:
# Loop over each row to check for breaks in service
for (i in 1:nrow(simp_2021)) {
row_values <- simp_2021[i, 2:ncol(simp_2021)] # Extract service columns for the current row
# Initialize flags to track service usage
in_service <- FALSE
found_break <- FALSE
# Check transitions within the row
for (j in 1:(length(row_values) - 1)) {
current_value <- row_values[[j]]
next_value <- row_values[[j + 1]]
if (is.na(current_value) && !is.na(next_value)) {
# Transition from not using service to using service
in_service <- TRUE
} else if (!is.na(current_value) && is.na(next_value)) {
# Transition from using service to not using service
if (in_service) {
found_break <- TRUE
break # Found a break, no need to check further
}
}
}
# Set Breaks_in_Service based on found breaks
if (found_break) {
simp_2021$Breaks_in_Service[i] <- TRUE
}
}
# View the updated dataframe with the new 'Breaks_in_Service' column
print(simp_2021)
You could, first, equal the month columns to binary !NA
using +!is.na()
and paste
to binary strings, next sub
away all starting 0
s, finally grepl
for the pattern 01
.
> (tmp <- apply(+!is.na(simp_2021[-1]), 1, paste, collapse=''))
[1] "1001" "0100" "0111" "1010" "1101"
> (tmp <- sub(tmp, pat='^0+', rep=''))
[1] "1001" "100" "111" "1010" "1101"
> (tmp <- grepl(tmp, pat='01'))
[1] TRUE FALSE FALSE TRUE TRUE
Altogether in a nice pipe:
> transform(simp_2021,
+ Breaks_in_Service=apply(+!is.na(simp_2021[-1]), 1, paste, collapse='') |>
+ sub(pat='^0+', rep='') |>
+ grepl(pat='01'))
ID jan21_ORG_NAME feb21_ORG_NAME mar21_ORG_NAME apr21_ORG_NAME Breaks_in_Service
1 1 Org A <NA> <NA> Org B TRUE
2 2 <NA> Org A <NA> <NA> FALSE
3 3 <NA> Org B Org B Org C FALSE
4 4 Org B <NA> Org D <NA> TRUE
5 5 Org B Org B <NA> Org E TRUE
Note: this approach also uses row-wise apply
, but on a "matrix"
for which it is designed for and is therefore efficient.
> is.matrix(+!is.na(simp_2021[-1]))
[1] TRUE
Data:
> dput(simp_2021)
structure(list(ID = c(1, 2, 3, 4, 5), jan21_ORG_NAME = c("Org A",
NA, NA, "Org B", "Org B"), feb21_ORG_NAME = c(NA, "Org A", "Org B",
NA, "Org B"), mar21_ORG_NAME = c(NA, NA, "Org B", "Org D", NA
), apr21_ORG_NAME = c("Org B", NA, "Org C", NA, "Org E")), class = "data.frame", row.names = c(NA,
-5L))