I'm hoping to determine when the values in a set of columns are redundant, documenting it in a new column multi?
where 0 means only a single value is seen, 1 means multiple values are seen. When the value "Unspecified"
is with other values, I'd like the code to ignore it and assess the redundancy of other values accordingly. When the value "Unspecified"
is the only value in the set of columns, I would like the column multi?
to document "Unspecified"
.
Of note, these four columns are just a segment of a much larger database with many more columns.
To illustrate what I mean, I have provided an example input and output below:
headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1 Intracerebral Intracerebral Intracerebral <NA>
2 Intracerebral Subarachnoid <NA> Subdural
3 Unspecified Intracerebral Subdural Intracerebral
4 Unspecified <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>
If the row was 1 for Multi?
, then I'd also like to document the number of unique values in the new column Number
Multi? Number
1 0 1
2 1 3
3 1 2
4 Unspecified 1
5 NA NA
That's really cumbersome and I would really advise against mixing numbers and characters in a column. Having said that and if you are open to a dplyr
based solution
library(dplyr)
data %>%
rowwise() %>%
summarise(
number = n_distinct(
c_across(headbleed_type_dx1:headbleed_type_dx4),
na.rm = TRUE),
unspec = coalesce(
any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"),
FALSE)) %>%
mutate(
number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
multi = case_when(number == 1 & unspec ~ "Unspecific",
number2 == 1 ~ "0",
is.na(number2) ~ NA_character_,
TRUE ~ "1"),
.keep = "none") %>%
select(number = number2, multi)
This returns
# A tibble: 6 × 2
number multi
<int> <chr>
1 1 0
2 3 1
3 2 1
4 1 Unspecific
5 NA NA
6 1 0
structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral",
"Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral",
"Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral",
NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA,
"Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
row = 1:4, col = c(NA_character_, NA_character_, NA_character_,
NA_character_), expected = c("4 columns", "4 columns", "4 columns",
"4 columns"), actual = c("5 columns", "5 columns", "5 columns",
"5 columns"), file = c("literal data", "literal data", "literal data",
"literal data")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character",
"collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))