rregex

How to extract a portion between the first and second occurrence of a character using regular expressions in R?


I have a dataframe that looks like this:

enter image description here

I need to extract the UniProt IDs between the first and second occurrence of | from the rows 5 to 14 only.

Expected outcome:

A0A3Q8IUE6
A4I9M8
E9BQL4
Q4Q3E9
A0A640KX53
E9B4M7
.
.

Solution

  • We can try using strsplit here with an apply function:

    df$output <- sapply(df$x, function(x) strsplit(x, "\\|")[[1]][2])
    df
    
                      x     output
    1 A|A0A3Q8IUE6|blah A0A3Q8IUE6
    2      B|A4I9M8|meh     A4I9M8
    

    Data:

    x <- "A|A0A3Q8IUE6|blah"
    y <- "B|A4I9M8|meh"
    df <- data.frame(x=c(x,y))
    

    Note: If certain x values in the data frame would not be in pipe-delimited format, and therefore would not have a second element, then output might get assigned to the original value. If you wanted some other behavior, we could use grepl to detect, e.g.

    df$output <- ifelse(grepl("^(?:tr|sp)\\|", df$x),
                        sapply(df$x, function(x) strsplit(x, "\\|")[[1]][2]),
                        NA)