rregexpcrestrsplit

Splitting a string with zero-width assertions but remove other string in R


Aussume we have a string:

test <- 'chr1:949920-950500_ENSG_00000187583'

Expected output:

'chr1:949920-950500'  'ENSG_00000187583'

We tried:

strsplit(test, '_(?<=ENS)', perl = TRUE)
[[1]]
[1] 'chr1:949920-950500_ENSG_00000187583'

We also want to split by follow pattern:

"chr1:949920-950500_P16"
# split to 
"chr1:949920-950500"    "P16"

Solution

  • (?<=ENS) means "preceded by ENS". The position after the _ can never preceded by ENS, so _(?<=ENS) can't ever match.

    Are you trying to split on all the _ that are followed by ENSG?

    _(?=ENSG)
    

    Read this as _ followed by ENSG.

    Are you trying to split on all the _ that aren't preceded by ENSG?

    You can use either of these:

    (?<!ENSG)_
    
    _(?<!ENSG_)
    

    (The second might be a tad bit more efficient. But I don't it's worth it for the the extra complexity.)