rdplyr

flagging rows with different non-NA values within the same grouping variable


I have a dataframe that contains multiple observations for some individuals. Some observations are missing data, but it isn't always the same data, so I am trying to merge the observations to get one row for each individual that is as complete as possible.

Here is a toy dataframe that resembles what I am working with:

df<-data.frame(Ind=c("C","C","C","B","B"),
          V1 = c("a",NA,"b","a",NA),
          V2 = c("b","b",NA,"a","a"),
          V3 = c(NA,"a","a",NA,"b"),
          obs.id = c(1,2,3,4,5))
> df
      Ind   V1   V2   V3 obs.id
    1     C    a    b <NA>      1
    2     C <NA>    b    a      2
    3     C    b <NA>    a      3
    4     B    a    a <NA>      4
    5     B <NA>    a    b      5

I have figured out a how to merge the data using summarise, like so:

> df%>%select_if(function(x) !all(is.na(x)))%>%
add_count(Ind)%>%filter(n>1)%>%
group_by(Ind)%>%
summarise(across(starts_with("V"),~max(., na.rm = TRUE)))

# A tibble: 2 × 4
  Ind   V1    V2    V3   
  <chr> <chr> <chr> <chr>
1 B     a     a     b    
2 C     b     b     a 

But you will notice that in df Ind C has two different values for V1, but the summarise command is just returning the "maximum" value, even though that isn't really a relevant descriptor in my data. This is bad and I need to know which individuals have non-identical variables in the different observations when one of them isn't NA. Tidyverse solutions are preferable, but I'm not picky about what the result looks like; it could be a vector of individuals with this problem, or a new column in df, whatever. I just need to know which individuals have this issue and can't figure out how to test this.


Solution

  • If you just want to flag if any variable has more than one element

    library(dplyr)
    
    df %>% 
      mutate(flag = any(across(starts_with("V"), ~ 
        length(na.omit(unique(.x))) > 1)), .by = Ind)
      Ind   V1   V2   V3 obs.id  flag
    1   C    a    b <NA>      1  TRUE
    2   C <NA>    b    a      2  TRUE
    3   C    b <NA>    a      3  TRUE
    4   B    a    a <NA>      4 FALSE
    5   B <NA>    a    b      5 FALSE
    

    Or, if all variables should be flagged

    library(dplyr)
    
    df %>% 
      mutate(across(starts_with("V"), ~ 
        length(na.omit(unique(.x))) > 1, .names="{.col}_flag"), .by = Ind)
      Ind   V1   V2   V3 obs.id V1_flag V2_flag V3_flag
    1   C    a    b <NA>      1    TRUE   FALSE   FALSE
    2   C <NA>    b    a      2    TRUE   FALSE   FALSE
    3   C    b <NA>    a      3    TRUE   FALSE   FALSE
    4   B    a    a <NA>      4   FALSE   FALSE   FALSE
    5   B <NA>    a    b      5   FALSE   FALSE   FALSE
    

    Getting all elements

    library(dplyr)
    
    df %>% 
      reframe(across(starts_with("V"), ~ 
        toString(na.omit(unique(.x)))), .by = Ind)
      Ind   V1 V2 V3
    1   C a, b  b  a
    2   B    a  a  b