rdata-wranglingdata-preprocessing

Data wrangling: Getting data within one year of a time point


I am trying to use R (tidyverse) to perform something like the following:

Suppose I have a data set with a subject ID, Visit Code, (say, 1 to 8 visits), Date of visit, demographics (age, sex, etc.), and two test results (test A and test B).

Test A was administered since the beginning of the study but not necessary at every visit. Test B began later (most commonly at Visit 5, but for some people at other Visits).

I want to make a cross-sectional dataset corresponding to the first time everyone performed test B (for most people that with be at Visit 5, but for others it will be another visit). I also want everyone to have a test A score that is at least with 1 year (+/-) of when test B was done (many people will have a test A score at the same time as test B, but some people don't have a test A score for the the visit with their first test B score say I then want to take the nearest test score from another visit only if it was within a year of the test B score).

I see people publishing work using this approach, but can you help me figure out how to code this to get this type of dataset from a master dataset?

To clarify the question, I put together a simple example (along with what the desired output I am hoping to get should be):

mydata <- data.frame(Id=c(1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3),
VISIT=c(1,2,3,4,5,6,7,8, 1,2,3,4,5,6,7,8, 1,2,3,4,5,6,7,8),
Time=c(0,1.1,1.9,3,4,5.1,6.1,6.9, 0,.9,2.1,3.1,4.1,5,6.1,7.2, 0,1,2.1,3.2,3.9,5.1,6,7.1),
Score_A=c(10,9,9,8,7,10,10,8, 5,9,4,3,NA,13,14,18, 9,9,10,11,NA,14,12,13),
Score_B=c(NA,NA,NA,NA,100,NA,90,NA, NA,NA,NA,NA,80,NA,99,NA, NA,NA,NA,NA,75,NA,97,NA) )

mydata

desired_output <- data.frame(Id=c(1,2,3), Score_A=c(7,13,11), Score_B=c(100,80,75))

I tried the following, but it's not accounting for the +/- 1 year so Person 2 has an NA for the Score_A:

Q <- mydata %>% 
group_by(Id) %>% 
arrange(Time, .by_group = T) %>% 
filter(!is.na(Score_B)) %>% 
slice(1)

Thanks!


Solution

  • Here is a data.table approach using a rolling join. It can be done using dplyr-joins (take a look at the helpfile on join_by(), but I feel more familiar with data.table.

    update

    library(data.table)
    # set to data.table format
    setDT(mydata)
    # get rows with first scoreB by Id
    firstB <- mydata[!is.na(Score_B), .SD[1], by = .(Id)]
    # all score_A values
    scoreA <- mydata[!is.na(Score_A), ]
    # rolling join on nearest Visit with score A
    firstB[, c("Score_A", "Time_A") := scoreA[firstB, .(Score_A, Time), on = .(Id, VISIT), roll = "nearest"]]
    # select wanted columns, possibility to control times A and B are within 1
    firstB[, .(Id, Time_A, Score_A, Time_B = Time, Score_B)]
    

    output"

       Id Time_A Score_A Time_B Score_B
    1:  1    4.0       7    4.0     100
    2:  2    3.1       3    4.1      80
    3:  3    3.2      11    3.9      75
    

    old answer

    library(data.table)
    # set to data.table format
    setDT(mydata)
    # get rows with first scoreB by Id
    firstB <- mydata[!is.na(Score_B), .SD[1], by = .(Id)]
    # all score_A values
    scoreA <- mydata[!is.na(Score_A), ]
    # rolling join on Visit + 1
    firstB[, Score_A := scoreA[firstB, Score_A, on = .(Id, VISIT), roll = -1]]
    # select wanted columns
    firstB[, .(Id, Score_A, Score_B)]
    #    Id Score_A Score_B
    # 1:  1       7     100
    # 2:  2      13      80