rdplyr

Finding the first occurrence of a value in a specific column in a dataframe


I have data that contains multiple replications of program output. For each replication, I need to sort by Name, then Time, then locate when the value in the Stage column changes. The row where the Stage value changes should then be written to a new dataframe. The new dataframe will contain all rows where that Stage variable changes value.

If my data looks like this:

Rep Time Name Stage
1 33.075 Tom none
1 85.972 Tom incubation
1 86.139 Tom incubation
1 29.120 Sam none
1 33.099 Sam incubation
1 90.568 Sam symptomatic
2 33.075 Tom none
2 85.972 Tom incubation
2 29.120 Tom none
2 33.099 Sam none
2 90.568 Sam none
3 32.515 Tom none
3 90.863 Joe symptomatic
3 86.139 Joe incubation
3 85.972 Tom incubation
3 29.120 Joe none

My resulting dataframe needs to look like this:

Rep Time Name Stage
1 85.972 Tom incubation
1 33.099 Sam incubation
1 90.568 Sam symptomatic
2 85.972 Tom incubation
3 85.972 Tom incubation
3 86.139 Joe incubation
3 90.863 Joe symptomatic

It doesn't seem like it would be that complicated, but I have been playing around with it all day and can't get it working properly.


Solution

  • In base R, we can try

    Y = 
      X[order(X$Rep, -xtfrm(X$Name), X$Time), ] |>
      transform(i = ave(Stage, Rep, Name, FUN=\(x) x!=c(NA, head(x, -1)))) |>
      subset(as.logical(i), -i)
    
    > Y
      Rep   Time Name       Stage
    2    1 85.972  Tom  incubation
    5    1 33.099  Sam  incubation
    6    1 90.568  Sam symptomatic
    8    2 85.972  Tom  incubation
    15   3 85.972  Tom  incubation
    14   3 86.139  Joe  incubation
    13   3 90.863  Joe symptomatic
    

    Data made reproducible

    X = structure(list(Rep = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 
    3, 3, 3), Time = c(33.075, 85.972, 86.139, 29.12, 33.099, 90.568, 
    33.075, 85.972, 29.12, 33.099, 90.568, 32.515, 90.863, 86.139, 
    85.972, 29.12), Name = c("Tom", "Tom", "Tom", "Sam", "Sam", "Sam", 
    "Tom", "Tom", "Tom", "Sam", "Sam", "Tom", "Joe", "Joe", "Tom", 
    "Joe"), Stage = c("none", "incubation", "incubation", "none", 
    "incubation", "symptomatic", "none", "incubation", "none", "none", 
    "none", "none", "symptomatic", "incubation", "incubation", "none"
    )), class = "data.frame", row.names = c(NA, -16L))