pythonregexbioinformaticsprotein-database

what does it mean? about Python regular expression


Last time my question was like, (How can I get contents between square brackets by using regular expression?)

#start gene g1
dog1
dog2
dog3
#protein sequence = [DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD]
#end gene g1
###
#start gene g2
cat1
cat2
cat3
#protein sequence = [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
#end gene g2
###
#start gene g3
pig1
pig2
pig3
...

I want to get contents between parentheses and make new file named 50267.fa like as follows

>g1_50267
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>g2_50267
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 
CCCCCCCCCCCCCCCCCCCC
...

and I got an answer like ,

import re

with open("50267.gff", "r") as ff:
    matches = re.findall(r'\[([^\]]+)', ff.read())
    matches = ['>g' + str(ind+1) + "_50267\n" + x.replace('\n#', ' ') for ind, x in enumerate(matches)]
#print(matches)
with open('50267.fa', 'w') as fa:
    fa.write("\n".join(matches))

when I tried with that code, it worked well. but I didn't understand what that following codes means

r'\[([^\]]+)'
x in enumerate(matches)

Solution

  • Lets look at the two things you're confused by.

    First: r'\[([^\]]+)'. This is a raw string literal (r'...'). In this context, "raw" just means that the backslashes are not interpreted by Python's compiler as part of an escape sequence, they're kept as actual backslash characters. That's important because the Regular Expression language also uses backslashes in its own escape sequences, and that's what we want here.

    The string \[([^\]]+) is a regex pattern that matches a literal [ character (escaped with a backslash, since a bracket otherwise has a special meaning that we'll see momentarily), followed by a capturing group (...) that contains one or more ...+ characters from a specific "character class" [...] (here's the other meaning to square brackets!). This character class is negated ^..., so it matches anything that is not ], a closing bracket. (The backslash escaping the closing bracket is actually not needed, as [^] is not a valid character class. Using [^]] works just as well as [^\]]. Including the backslash is harmless though.)

    So the pattern matches input that starts with an opening square bracket, and then captures one or more characters that follow as long as they're not a closing bracket.

    The other thing you're confused by is for ind, x in enumerate(matches) (I've cut out a slightly larger big of the code than you did). The enumerate function takes an iterable argument and returns an iterator that yields index, item two-tuples. The first value of each tuple is an integer, starting (by default) at zero and counting up one by one. The second value is a value that comes from the iterable given to enumerate.

    The for loop unpacks the values from the tuples into variables named ind and x, which it uses elsewhere to build the strings for each line that will go into the output. The index number ind is used to generate the g1, g2 names rather than parsing them from the file. As long as the gene numbers are strictly sequential in each file, that should be fine.