I have a long format dataset of 100,000+ individuals, capturing clinic visits at 5 different time points (not chronological). I've included an example dataset below that replicates the formatting of my data:
ID
: participant ID visit_number: order of the clinic visits in the
original dataset
age_visit
: age at the time of the visit
clinic_number
: the identifier for the specific clinic location
age_sorted
: For each ID, age sorted in ascending order across the 5
clinic visits
age_sorted_index
: For each ID, the visit number
corresponding to the sorted age
I would like to create a new column (clinic_number_extracted
) that fetches the clinic identifier (clinic_number
) corresponding to each sorted age (age_sorted
value) for each participant. I was thinking that it might be possible to use the age_sorted_index
and visit_number
variables to do so (generating key-value pairs?), but am not quite sure how to do this outside of data.table
. A tidyverse solution would be preferred.
I've looked on R community and stack exchange for clues, but haven't been able to find exactly what I'm looking for (likely not using the correct search terms). I tried to play around with group_by(across())
and with_order(order_by())
functions without much success. I can potentially create a new variable with a few case_when()
conditions but might run into issues if there are repeated age_assessment_sorted values
.
set.seed(42)
# Beginning dataset
das <- data.frame(id = rep(letters[1:3], each = 5),
visit_number = rep(1:5, times = 3),
age_visit = c(50, rep(NA_real_, times = 7), 34, 40, 72, rep(NA_real_, times = 3), 87),
clinic_number = sample(30:50, 15, replace=TRUE),
age_sorted = c(50, rep(NA_real_, times = 4), 34, 40,rep(NA_real_, times = 3), 72, 87, rep(NA_real_, times = 3)),
age_sorted_index = c(rep(1:5), 4, 5, rep(1:3), 1, 5, 2, 3, 4))
# Print out dataset
das
#> id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1 a 1 50 46 50 1
#> 2 a 2 NA 34 NA 2
#> 3 a 3 NA 30 NA 3
#> 4 a 4 NA 39 NA 4
#> 5 a 5 NA 33 NA 5
#> 6 b 1 NA 47 34 4
#> 7 b 2 NA 46 40 5
#> 8 b 3 NA 44 NA 1
#> 9 b 4 34 36 NA 2
#> 10 b 5 40 33 NA 3
#> 11 c 1 72 34 72 1
#> 12 c 2 NA 43 87 5
#> 13 c 3 NA 49 NA 2
#> 14 c 4 NA 47 NA 3
#> 15 c 5 87 44 NA 4
Desired data:
das_final <- cbind(das,
clinic_number_extracted = c(46, rep(NA_real_, times = 4), 36, 33, rep(NA_real_, times = 3), 34, 44, rep(NA_real_, times = 3)))
# Print out final dataset
das_final
#> id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1 a 1 50 46 50 1
#> 2 a 2 NA 34 NA 2
#> 3 a 3 NA 30 NA 3
#> 4 a 4 NA 39 NA 4
#> 5 a 5 NA 33 NA 5
#> 6 b 1 NA 47 34 4
#> 7 b 2 NA 46 40 5
#> 8 b 3 NA 44 NA 1
#> 9 b 4 34 36 NA 2
#> 10 b 5 40 33 NA 3
#> 11 c 1 72 34 72 1
#> 12 c 2 NA 43 87 5
#> 13 c 3 NA 49 NA 2
#> 14 c 4 NA 47 NA 3
#> 15 c 5 87 44 NA 4
#> clinic_number_extracted
#> 1 46
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 36
#> 7 33
#> 8 NA
#> 9 NA
#> 10 NA
#> 11 34
#> 12 44
#> 13 NA
#> 14 NA
#> 15 NA
Created on 2022-05-06 by the reprex package (v2.0.1)
We may use match
between the 'age_sorted' and 'age_visit', use that index to subset the 'clinic_number' and replace the elements where they are NA in 'age_sorted' to NA
library(dplyr)
das_final %>%
group_by(id) %>%
mutate(clinic_new = clinic_number[match(age_sorted, age_visit)] *
NA^is.na(age_sorted)) %>%
ungroup
-output
# A tibble: 15 × 8
id visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted clinic_new
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 a 1 50 46 50 1 46 46
2 a 2 NA 34 NA 2 NA NA
3 a 3 NA 30 NA 3 NA NA
4 a 4 NA 39 NA 4 NA NA
5 a 5 NA 33 NA 5 NA NA
6 b 1 NA 47 34 4 36 36
7 b 2 NA 46 40 5 33 33
8 b 3 NA 44 NA 1 NA NA
9 b 4 34 36 NA 2 NA NA
10 b 5 40 33 NA 3 NA NA
11 c 1 72 34 72 1 34 34
12 c 2 NA 43 87 5 44 44
13 c 3 NA 49 NA 2 NA NA
14 c 4 NA 47 NA 3 NA NA
15 c 5 87 44 NA 4 NA NA