rdataframesubsetminmax

Removing people from a dataframe in R if their results don't span a specific period of time


Dataset of blood results:

        id result date    
        A1 80     01/01/2006
        A1 70     02/10/2006
        A1 61     01/01/2007
        A1 30     01/01/2008
        A1 28     03/06/2008
        B2 40     01/01/2006
        B2 30     01/10/2006
        B2 25     01/01/2015
        B2 10     01/01/2020
        G3 28     01/01/2009
        G3 27     01/01/2014
        G3 25     01/01/2013
        G3 24     01/01/2011
        G3 22     01/01/2019
        U7 20     01/01/2005
        U7 19     01/01/2006
        U7 18     01/04/2006
        U7 18     01/08/2006

I would like to only keep those individuals who have blood results spanning at least a three year period.

I can convert their dates to just years along the line of:

df %>% 
  # Create a column holding year for each ID
  mutate(date = dmy(date)) %>% 
  mutate(year = year(date)) %>% 
  # group by ID
  group_by(ID, year) %>% 
  # find max diff
  summarise(max_diff = max(year) - min(year))

How would I then continue the pipe to remove those with a max diff <3. The desired output from the above example would be:

    id result date    
    B2 40     2006
    B2 30     2006
    B2 25     2015
    B2 10     2020
    G3 28     2009
    G3 27     2014
    G3 25     2013
    G3 24     2011
    G3 22     2019

I would then pipe these people into the answer for this question: Predicting when an output might happen in time in R

Many thanks


Solution

  • Cutting down to year only is not a great move because you'll under/overestimate the range depending on whether people had their first/last blood test early or late in the year (case in point, A1 has blood tests in 3 separate calendar years, but less than 3 years apart overall).

    To actually work out the range between first and last blood test, you can use diff(range()). In this case, I'm using tapply.

    df <- structure(list(id = c("A1", "A1", "A1", "A1", "A1", "B2", "B2", 
    "B2", "B2", "G3", "G3", "G3", "G3", "G3", "U7", "U7", "U7", "U7"
    ), result = c(80L, 70L, 61L, 30L, 28L, 40L, 30L, 25L, 10L, 28L, 
    27L, 25L, 24L, 22L, 20L, 19L, 18L, 18L), date = structure(c(13149, 
    13423, 13514, 13879, 14033, 13149, 13422, 16436, 18262, 14245, 
    16071, 15706, 14975, 17897, 12784, 13149, 13239, 13361), class = "Date")), 
    row.names = c(NA, -18L), class = "data.frame")
    
    ranges <- tapply(df$date, df$id, function(dates) diff(range(dates))/365.25)
    ranges
    
           A1        B2        G3        U7 
     2.420260 13.998631  9.998631  1.579740 
    
    keep <- names(ranges)[ranges>=3]
    keep
    [1] "B2" "G3"
    

    You can flip this to use pipes if that is your preferred syntax.