I have made certain rules that I need to search for in a file. These rules are essentially phrases with an unknown number of words within. For example,
mutant...causes(...)GS
Here, this a phrase, which I want to search for in my file. The ...
means a few words should be here(i.e. in this gap) & (...)
means there may/may not be words in this gap. GS
here is a fixed string variable that I know.
Basically I made these rules by going through many such files and they tell me that a particular file does what I am looking for.
The problem is that the gap can have any(small) number of words. There can even be a new line that begins in one of the gaps. Hence, I cannot go for identical string matching.
Some example texts -
!Series_summary "To better understand how the expression of a *mutant gene that causes ALS* can perturb the normal phenotype of astrocytes, and to identify genes that may
Here the GS is ALS (defined) and the starred text should be found as a positive match for the rule mutant...causes(...)GS
!Series_overall_design "The analysis includes 9 samples of genomic DNA from
isolated splenic CD11c+ dendritic cells (>95% pure) per group. The two groups are neonates born to mothers with *induced allergy to ovalbumin*, and normal control neonates. All neonates are genetically and environmentally identical, and allergen-naive."
Here the GS is ovalbumin (defined) and the starred text should be found as a positive match for the rule
induced...to GS
I am a beginner in programming in python, so any help will be great!!
The following should get you started, it will read in your file and display all possible matching lines using a Python regular expression, this will help you to determine that it is matching all of the correct lines:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
print re.findall(r'(mutant\s.*?\scauses.*?GS)', data, re.S)
To then just search for just the presence of one match, change findall
to search
:
import re
with open('input.txt', 'r') as f_input:
data = f_input.read()
if re.search(r'(mutant\s.*?\scauses.*?GS)', data, re.S):
print 'found'
To carry this out on many such files, you could adapt it as follows:
import re
import glob
for filename in glob.glob('*.*'):
with open(filename, 'r') as f_input:
data = f_input.read()
if re.search(r'mutant\s.*?\scauses.*?GS', data, re.S):
print "'{}' matches".format(filename)