I've got a string:
s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"
and a regular expression I'm using
import re
delre = re.compile('-[0-9]+[ACGTNacgtn]+') #this is almost correct
print (delre.findall(s))
This returns:
['-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-2gg', '-1gtt', '-1gt', '-3ggg']
But -1gtt
and -1gt
are not desired matches. The integer in this case defines how many subsequent characters to match, so the desired output for those two matches would be -1g
and -1g
, respectively.
Is there a way to grab the integer after the dash and dynamically define the regex so that it matches that many and only that many subsequent characters?
You can't do this with the regex pattern directly, but you can use capture groups to separate the integer and character portions of the match, and then trim the character portion to the appropriate length.
import re
# surround [0-9]+ and [ACGTNacgtn]+ in parentheses to create two capture groups
delre = re.compile('-([0-9]+)([ACGTNacgtn]+)')
s = ".,-2gg,,,-2gg,-2gg,,,-2gg,,,,,,,,t,-2gg,,,,,,-2gg,t,,-1gtt,,,,,,,,,-1gt,-3ggg"
# each match should be a tuple of (number, letter(s)), e.g. ('1', 'gtt') or ('2', 'gg')
for number, bases in delre.findall(s):
# print the number, then use slicing to truncate the string portion
print(f'-{number}{bases[:int(number)]}')
This prints
-2gg
-2gg
-2gg
-2gg
-2gg
-2gg
-1g
-1g
-3ggg
You'll more than likely want to do something other than print
, but you can format the matched strings however you need!
NOTE: this does fail in cases where the integer is followed by fewer matching characters than it specifies, e.g. -10agcta
is still a match even though it only contains 5 characters.