rstringbioinformaticsrna-seq

How to create a regex expression to get a substring between 2 pipes


I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:

ENST00000000233.10|ENSG00000004059.11|OTTHUMG000

I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?


Solution

  • Here is a regex.

    x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
    sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
    #> [1] "ENSG00000004059.11"
    

    Created on 2022-05-03 by the reprex package (v2.0.1)

    Explanation:

    Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.