pythonbioinformaticstext-processing

SAM alignment: Extract specific region in the query sequence and its enclosing parts in the CIGAR string


I need to perform the local alignment of a given region of a DNA sequence for which a global alignment was made, and update the corresponding part of the global CIGAR string.

The steps would be as follow:

1. Extract the region of interest from the "query" sequence according to the CIGAR string of the global alignment.
2. Perform the local alignment of the region and generate the local CIGAR string.
3. Extract from the global CIGAR string the operations "before" and "after" the region of interest.
4. Generate the new global CIGAR string

I would like to get help for #1 and #3

Here's a concrete example:

Input:
------
Reference sequence: ACGGCATCAGCGATCATCGGCATATCGAC (against which the global alignment was made)
Query sequence:     TCATCAAGCGTCGCCCTC
CIGAR string:       1S5M1I3M4D6M2D2M
CIGAR position:     5 (1-based, position of the leftmost mapped nucleotide in the reference)
Genomic region:     7-23 (1-based, inclusive)

And the interpretation of the above data:

Reference:   ACGGCA|TCA-GCGATCATCGGCAT|ATCGAC
Query:          .CA|TCAAGCG----TCGCCC-|-TC
CIGAR*:         SMM|MMMIMMMDDDDMMMMMMD|DMM
Position:        ↑

note: the genomic region is in between the | characters

So, in this example I would need to extract TCAAGCGTCGCCC from Query and SMM & DMM from CIGAR*.


Here's my clumsy attempt:

import re
import itertools

# Input:
reference = 'ACGGCATCAGCGATCATCGGCATATCGAC'
query     = 'TCATCAAGCGTCGCCCTC'
cigar     = '1S5M1I3M4D6M2D2M'
region    = (7-1, 23)
position  = 5-1

# Now trying the extract the region in the query and the outer parts of the CIGAR:
x0, x1 = region       # range in reference
y0, y1 = (None, None) # range in query
c0, c1 = ([], [])     # cigar operations, before and after the region
x      = position     # current position in reference
y      = 0            # current position in query

for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):

    if x < x0: 
        c0.append(op)
    elif x < x1:
        if y0 == None:
            y0 = y 
        else:
            y1 = y 
    else:
        c1.append(op)

    if op == 'M':
        x += 1
        y += 1
    elif op in ['S','I']:
        y += 1
    elif op == 'D':
        x += 1

y1 += 1

print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCCT', 'DMM')

The problem is that I get a superfluous T at the end of the query region.


Solution

  • Only when encountering an operation that increments y should you update y0 and y1. Here you would need to exclude D, but you can make a more generic code by storing the increments of an operation in a dict:

    increments = {
        'M': {'x': 1, 'y': 1},
        'I': {'x': 0, 'y': 1},
        'D': {'x': 1, 'y': 0},
        'S': {'x': 0, 'y': 1},
    }
    
    for op in itertools.chain.from_iterable(
        itertools.repeat(m[1], int(m[0]))
        for m in re.findall(r'(\d+)([MIDS])', cigar)
    ):
    
        if x < x0: 
            c0.append(op)
        elif x < x1:
            if increments[op]['y']:
                if y0 == None:
                    y0 = y 
                y1 = y+1 
        else:
            c1.append(op)
    
        x += increments[op]['x']
        y += increments[op]['y']
    
    print( (''.join(c0), query[y0:y1], ''.join(c1)) )
    
    ('SMM', 'TCAAGCGTCGCCC', 'DMM')