pythonregexbiopythonstring-searchsequence-analysis

Python/Biopython: How to search for a reference sequence (string) in a sequence with gaps?


I am facing the following problem and have not found a solution yet:

I am working on a tool for sequence analysis which uses a file with reference sequences and tries to find one of these reference sequences in a test sequence.

The problem is that the test sequence might contain gaps (for example: ATG---TCA). I want my tool to find a specific reference sequence as substring of the test sequence even if the reference sequence is interrupted by gaps (-) in the test sequence.

For example:

one of my reference sequences: a = TGTAACGAACGG

my test sequence: b = ACCT**TGT--CGAA-GG**AGT

(the corresponding part from the reference sequence is given in bold)

I though about regular expressions and tried to work myself into it but if I am not wrong regular expressions only work the other way round. So I would need to include the gap positions as regular expressions into the reference sequence and than map it against the test sequence. However, I do not know the positions, the length and the number of gaps in the test sequence. My idea was to exchange gap positions (so all -) in the test sequence string into some kind of regular expressions or into a special character which stand for any other character in the reference sequence. Than I would compare the unmodified reference sequences against my modified test sequence... Unfortunately I have not found a function in python for string search or a type of regular expression which could to this.

Thank you very much!


Solution

  • There's good news and there's bad news...

    Bad news first: What you are trying to do it not easy and regex is really not the way to do it. In a simple case regex could be made to work (maybe) but it will be inefficient and would not scale.

    However, the good news is that this is well understood problem in bioinformatics (e.g. see https://en.wikipedia.org/wiki/Sequence_alignment). Even better news is that there are tools in Biopython that can help you. E.g. http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

    EDIT From the discussion below it seems you are saying that 'b' is likely to be very long, but assuming 'a' is still short (12 bases in your example above) I think you can tackle this by iterating over every 12-mer in 'b'. I.e. divide 'b' into sequences that are 12 bases long (obviously you'll end up with a lot!). You can then easily compare the two sequences. If you really want to use regex (and I still advise you not to) then you can replace the '-' with a '.' and do a simple match. E.g.

    import re
    
    ''' a is the reference '''
    a = 'TGTAACGAACGG'
    
    ''' b is 12-mer taken from the seqence of interest, in reality you'll be  doing this test for every possible 12-mer in the sequence'''
    b = 'TGT--CGAA-GG'
    
    b = b.replace('-', '.')
    r = re.compile(b);
    m = r.match(a)
    
    print(m)