rsamplereplicate

R - Identifying only strings ending with A and B in a column


I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:

df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))

What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.

The end result of what I'm looking for would look like this: "S_026" and "S_028" as these are the only ones that have A and B at the end.

All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.

Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.

Any suggestions?


Solution

  • We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring

    library(dplyr)
    df %>% 
       group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>% 
       filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>% 
       ungroup %>% 
       select(-grp)
    

    -output

    # A tibble: 5 x 1
      Samples
      <chr>  
    1 S_026A 
    2 S_026B 
    3 S_028A 
    4 S_028B 
    5 S_026B