rgroup-bykeytidyverselong-format-data

Fetching values from one column based on other column keys in long-formatted dataset


I have a long format dataset of 100,000+ individuals, capturing clinic visits at 5 different time points (not chronological). I've included an example dataset below that replicates the formatting of my data:

I would like to create a new column (clinic_number_extracted) that fetches the clinic identifier (clinic_number) corresponding to each sorted age (age_sorted value) for each participant. I was thinking that it might be possible to use the age_sorted_index and visit_number variables to do so (generating key-value pairs?), but am not quite sure how to do this outside of data.table. A tidyverse solution would be preferred.

I've looked on R community and stack exchange for clues, but haven't been able to find exactly what I'm looking for (likely not using the correct search terms). I tried to play around with group_by(across()) and with_order(order_by()) functions without much success. I can potentially create a new variable with a few case_when() conditions but might run into issues if there are repeated age_assessment_sorted values.

set.seed(42)

# Beginning dataset
das <- data.frame(id = rep(letters[1:3], each = 5), 
                  visit_number = rep(1:5, times = 3),
                  age_visit = c(50, rep(NA_real_, times = 7), 34, 40, 72, rep(NA_real_, times = 3), 87), 
                  clinic_number = sample(30:50, 15, replace=TRUE), 
                  age_sorted = c(50, rep(NA_real_, times = 4), 34, 40,rep(NA_real_, times = 3), 72, 87, rep(NA_real_, times = 3)), 
                  age_sorted_index = c(rep(1:5), 4, 5, rep(1:3), 1, 5, 2, 3, 4)) 

# Print out dataset
das
#>    id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1   a            1        50            46         50                1
#> 2   a            2        NA            34         NA                2
#> 3   a            3        NA            30         NA                3
#> 4   a            4        NA            39         NA                4
#> 5   a            5        NA            33         NA                5
#> 6   b            1        NA            47         34                4
#> 7   b            2        NA            46         40                5
#> 8   b            3        NA            44         NA                1
#> 9   b            4        34            36         NA                2
#> 10  b            5        40            33         NA                3
#> 11  c            1        72            34         72                1
#> 12  c            2        NA            43         87                5
#> 13  c            3        NA            49         NA                2
#> 14  c            4        NA            47         NA                3
#> 15  c            5        87            44         NA                4

Desired data:

das_final <- cbind(das, 
                  clinic_number_extracted = c(46, rep(NA_real_, times = 4), 36, 33, rep(NA_real_, times = 3), 34, 44, rep(NA_real_, times = 3)))

# Print out final dataset
das_final 
#>    id visit_number age_visit clinic_number age_sorted age_sorted_index
#> 1   a            1        50            46         50                1
#> 2   a            2        NA            34         NA                2
#> 3   a            3        NA            30         NA                3
#> 4   a            4        NA            39         NA                4
#> 5   a            5        NA            33         NA                5
#> 6   b            1        NA            47         34                4
#> 7   b            2        NA            46         40                5
#> 8   b            3        NA            44         NA                1
#> 9   b            4        34            36         NA                2
#> 10  b            5        40            33         NA                3
#> 11  c            1        72            34         72                1
#> 12  c            2        NA            43         87                5
#> 13  c            3        NA            49         NA                2
#> 14  c            4        NA            47         NA                3
#> 15  c            5        87            44         NA                4
#>    clinic_number_extracted
#> 1                       46
#> 2                       NA
#> 3                       NA
#> 4                       NA
#> 5                       NA
#> 6                       36
#> 7                       33
#> 8                       NA
#> 9                       NA
#> 10                      NA
#> 11                      34
#> 12                      44
#> 13                      NA
#> 14                      NA
#> 15                      NA

Created on 2022-05-06 by the reprex package (v2.0.1)


Solution

  • We may use match between the 'age_sorted' and 'age_visit', use that index to subset the 'clinic_number' and replace the elements where they are NA in 'age_sorted' to NA

    library(dplyr)
    das_final %>% 
      group_by(id) %>%
      mutate(clinic_new = clinic_number[match(age_sorted, age_visit)] * 
          NA^is.na(age_sorted)) %>%
      ungroup
    

    -output

    # A tibble: 15 × 8
       id    visit_number age_visit clinic_number age_sorted age_sorted_index clinic_number_extracted clinic_new
       <chr>        <int>     <dbl>         <int>      <dbl>            <dbl>                   <dbl>      <dbl>
     1 a                1        50            46         50                1                      46         46
     2 a                2        NA            34         NA                2                      NA         NA
     3 a                3        NA            30         NA                3                      NA         NA
     4 a                4        NA            39         NA                4                      NA         NA
     5 a                5        NA            33         NA                5                      NA         NA
     6 b                1        NA            47         34                4                      36         36
     7 b                2        NA            46         40                5                      33         33
     8 b                3        NA            44         NA                1                      NA         NA
     9 b                4        34            36         NA                2                      NA         NA
    10 b                5        40            33         NA                3                      NA         NA
    11 c                1        72            34         72                1                      34         34
    12 c                2        NA            43         87                5                      44         44
    13 c                3        NA            49         NA                2                      NA         NA
    14 c                4        NA            47         NA                3                      NA         NA
    15 c                5        87            44         NA                4                      NA         NA