I am trying to use R (tidyverse) to perform something like the following:
Suppose I have a data set with a subject ID, Visit Code, (say, 1 to 8 visits), Date of visit, demographics (age, sex, etc.), and two test results (test A and test B).
Test A was administered since the beginning of the study but not necessary at every visit. Test B began later (most commonly at Visit 5, but for some people at other Visits).
I want to make a cross-sectional dataset corresponding to the first time everyone performed test B (for most people that with be at Visit 5, but for others it will be another visit). I also want everyone to have a test A score that is at least with 1 year (+/-) of when test B was done (many people will have a test A score at the same time as test B, but some people don't have a test A score for the the visit with their first test B score say I then want to take the nearest test score from another visit only if it was within a year of the test B score).
I see people publishing work using this approach, but can you help me figure out how to code this to get this type of dataset from a master dataset?
To clarify the question, I put together a simple example (along with what the desired output I am hoping to get should be):
mydata <- data.frame(Id=c(1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3),
VISIT=c(1,2,3,4,5,6,7,8, 1,2,3,4,5,6,7,8, 1,2,3,4,5,6,7,8),
Time=c(0,1.1,1.9,3,4,5.1,6.1,6.9, 0,.9,2.1,3.1,4.1,5,6.1,7.2, 0,1,2.1,3.2,3.9,5.1,6,7.1),
Score_A=c(10,9,9,8,7,10,10,8, 5,9,4,3,NA,13,14,18, 9,9,10,11,NA,14,12,13),
Score_B=c(NA,NA,NA,NA,100,NA,90,NA, NA,NA,NA,NA,80,NA,99,NA, NA,NA,NA,NA,75,NA,97,NA) )
mydata
desired_output <- data.frame(Id=c(1,2,3), Score_A=c(7,13,11), Score_B=c(100,80,75))
I tried the following, but it's not accounting for the +/- 1 year so Person 2 has an NA for the Score_A:
Q <- mydata %>%
group_by(Id) %>%
arrange(Time, .by_group = T) %>%
filter(!is.na(Score_B)) %>%
slice(1)
Thanks!
Here is a data.table
approach using a rolling join. It can be done using dplyr-joins (take a look at the helpfile on join_by()
, but I feel more familiar with data.table
.
update
library(data.table)
# set to data.table format
setDT(mydata)
# get rows with first scoreB by Id
firstB <- mydata[!is.na(Score_B), .SD[1], by = .(Id)]
# all score_A values
scoreA <- mydata[!is.na(Score_A), ]
# rolling join on nearest Visit with score A
firstB[, c("Score_A", "Time_A") := scoreA[firstB, .(Score_A, Time), on = .(Id, VISIT), roll = "nearest"]]
# select wanted columns, possibility to control times A and B are within 1
firstB[, .(Id, Time_A, Score_A, Time_B = Time, Score_B)]
output"
Id Time_A Score_A Time_B Score_B
1: 1 4.0 7 4.0 100
2: 2 3.1 3 4.1 80
3: 3 3.2 11 3.9 75
old answer
library(data.table)
# set to data.table format
setDT(mydata)
# get rows with first scoreB by Id
firstB <- mydata[!is.na(Score_B), .SD[1], by = .(Id)]
# all score_A values
scoreA <- mydata[!is.na(Score_A), ]
# rolling join on Visit + 1
firstB[, Score_A := scoreA[firstB, Score_A, on = .(Id, VISIT), roll = -1]]
# select wanted columns
firstB[, .(Id, Score_A, Score_B)]
# Id Score_A Score_B
# 1: 1 7 100
# 2: 2 13 80