I'm trying to use regular expressions to extract a digit as well as a number of characters equal to that digit from a string. This is for analyzing a pileup summary output from samtools mpileup
(see here). I'm doing this is python.
As an example, let's say I have the following string:
.....+3AAAT.....
I am trying to extract the +3AAA
from the string, leaving us with:
.....T.....
Note that the T
remains, because I only wanted to extract 3 characters (because the string indicated that 3 should be extracted).
I could do the following:
re.sub("\+[0-9]+[ACGTNacgtn]+", "", ".....+3AAAT.....")
But this would cut out the T
as well, leaving us with:
..........
Is there a way to use the information in a string to adjust the pattern in a regular expression? There are ways I could go around using regular expressions to do this, but if there's a way regular expressions can do it I'd rather use that way.
You can pass a lambda
to re.sub()
:
import re
def replace(string):
replaced = re.sub(
r'\+([0-9]+)([ACGTNacgtn]+)',
# group(1) = '3', group(2) = 'AAAT'
lambda match: match.group(2)[int(match.group(1)):],
string
)
return replaced
Try it:
string = '.....+3AAAT.....'
print(replace(string)) # '.....T.....'
string = '.....+10AAACCCGGGGTN.....'
print(replace(string)) # '.....TN.....'
string = '.....+0AN.....'
print(replace(string)) # '.....AN.....'
string = '.....+5CAGN.....'
print(replace(string)) # '..........'