rubyregexbioinformaticssequences

Ruby regular expression for sequence with specified start and end


I have this string:

mRNA = "gcgagcgagcaugacgcauguactugacaugguuuaaggccgauuagugaaugugcagacgcgcauaguggcgagcuaaaaacat"

I want to upcase subsequences out of this given sequence. A subsequence should start with aug and should end with either uaa, uag or uga. When I use the following regular expression in combination with gsub!:

mRNA.gsub!(/(aug.*uaa)|(aug.*uag)|(aug.*uga)/, &:upcase)

it results in

gcgagcgagcAUGACGCAUGUACTUGACAUGGUUUAAGGCCGAUUAGUGAAUGUGCAGACGCGCAUAGUGGCGAGCUAAaaacat

I don’t understand why it upcases one whole chunk instead of giving me two subsequences like this: gcgagcgagcAUGACGCAUGUACTUGACAUGGUUUAAggccgauuagugaAUGUGCAGACGCGCAUAGuggcgagcuaaaaacat

What regular expression can I use to achieve this?


Solution

  • The .* operator is known as "greedy," which means it will grab up as many characters as it can while still matching the pattern.

    To grab the smallest possible number of characters, use the "non-greedy" operator, .*?.

    Modifying your original regex:

    mRNA.gsub!(/(aug.*?uaa)|(aug.*?uag)|(aug.*?uga)/, &:upcase)
    

    There are certainly smaller regexes that will do the job, though. Using @stribizhev's suggestion:

    mRNA.gsub!(/aug.*?(?:uaa|uag|uga)/, &:upcase)