I have this string:
mRNA = "gcgagcgagcaugacgcauguactugacaugguuuaaggccgauuagugaaugugcagacgcgcauaguggcgagcuaaaaacat"
I want to upcase subsequences out of this given sequence. A subsequence should start with aug
and should end with either uaa
, uag
or uga
.
When I use the following regular expression in combination with gsub!
:
mRNA.gsub!(/(aug.*uaa)|(aug.*uag)|(aug.*uga)/, &:upcase)
it results in
gcgagcgagcAUGACGCAUGUACTUGACAUGGUUUAAGGCCGAUUAGUGAAUGUGCAGACGCGCAUAGUGGCGAGCUAAaaacat
I don’t understand why it upcases one whole chunk instead of giving me two subsequences like this:
gcgagcgagcAUGACGCAUGUACTUGACAUGGUUUAAggccgauuagugaAUGUGCAGACGCGCAUAGuggcgagcuaaaaacat
What regular expression can I use to achieve this?
The .*
operator is known as "greedy," which means it will grab up as many characters as it can while still matching the pattern.
To grab the smallest possible number of characters, use the "non-greedy" operator, .*?
.
Modifying your original regex:
mRNA.gsub!(/(aug.*?uaa)|(aug.*?uag)|(aug.*?uga)/, &:upcase)
There are certainly smaller regexes that will do the job, though. Using @stribizhev's suggestion:
mRNA.gsub!(/aug.*?(?:uaa|uag|uga)/, &:upcase)