pythonregexsearchnlpmatch-phrase

Searching for specific phrase pattern within lines. python


I have made certain rules that I need to search for in a file. These rules are essentially phrases with an unknown number of words within. For example,

mutant...causes(...)GS

Here, this a phrase, which I want to search for in my file. The ... means a few words should be here(i.e. in this gap) & (...) means there may/may not be words in this gap. GS here is a fixed string variable that I know.

Basically I made these rules by going through many such files and they tell me that a particular file does what I am looking for.

The problem is that the gap can have any(small) number of words. There can even be a new line that begins in one of the gaps. Hence, I cannot go for identical string matching.

Some example texts -

  1. !Series_summary "To better understand how the expression of a *mutant gene that causes ALS* can perturb the normal phenotype of astrocytes, and to identify genes that may

Here the GS is ALS (defined) and the starred text should be found as a positive match for the rule mutant...causes(...)GS

  1. !Series_overall_design "The analysis includes 9 samples of genomic DNA from isolated splenic CD11c+ dendritic cells (>95% pure) per group. The two groups are neonates born to mothers with *induced allergy to ovalbumin*, and normal control neonates. All neonates are genetically and environmentally identical, and allergen-naive."

Here the GS is ovalbumin (defined) and the starred text should be found as a positive match for the rule induced...to GS

I am a beginner in programming in python, so any help will be great!!


Solution

  • The following should get you started, it will read in your file and display all possible matching lines using a Python regular expression, this will help you to determine that it is matching all of the correct lines:

    import re
    
    with open('input.txt', 'r') as f_input:
        data = f_input.read()
        print re.findall(r'(mutant\s.*?\scauses.*?GS)', data, re.S)
    

    To then just search for just the presence of one match, change findall to search:

    import re
    
    with open('input.txt', 'r') as f_input:
        data = f_input.read()
        if re.search(r'(mutant\s.*?\scauses.*?GS)', data, re.S):
            print 'found'
    

    To carry this out on many such files, you could adapt it as follows:

    import re
    import glob
    
    for filename in glob.glob('*.*'):
        with open(filename, 'r') as f_input:
            data = f_input.read()
            if re.search(r'mutant\s.*?\scauses.*?GS', data, re.S):
                print "'{}' matches".format(filename)