pythonwspconsensus

consensus score and WSP score in python


if I have 3 DNA sequences and I want to evaluate them by some functions:

 seq1='AG_CT'
 seq2='AG_CT'
 seq3='ACT_T'

How can I calculate the consensus score and the weighted sum of pairs score (WSP score) of this three DNA sequences in python?

consensus score is the sum of the pairwise score between sequences and the consensus sequence, Consensus (A)=sum{l}^{i=1}d(i) l is the lenght of sequence , d is the distance between two bases, example: d(A,B)=2 for A!=B, d(A,-)=d(-,A)=1 for A!='-',0 else. A and B may be 'A or C or G or T ' for the above example

     we calculate distance between seq1 and seq2 then seq1 and seq3 then seq2 and seq3

**seq1 and seq2:**
d(A,A)=0, d(G,G)=0, d(-,-)=0, d(c,c)=0, d(t,t)=0
**seq1 and seq3**:
d(A,A)=0, d(G,C)=2, d(-,T)=1, d(c,-)=1, d(t,t)=0
**seq2 and seq3**:
d(A,A)=0, d(G,C)=2, d(-,T)=1, d(c,-)=1, d(t,t)=0


         seq1= A  G  _  C  T
         seq2= A  G  _  C  T
         seq3= A  C  T  _  T
               0  0  0  0  0
               0  2  1  1  0
               0  2  1  1  0
               ++++++++++++++
               0+ 4+ 2+ 2+ 0= 8

consensus(A)=8

weighted sum of pairs WSP (A) = \sum_{i=1}^{k-1} \sum_{j=i+l}^k \sum_{h=1}^l wij* s( A[ i,h ], [ j,h ] l :lenght of sequence, k number of sequences , wij weight of sequences i and j

s(A,B)=2 for A!=B, s(A,-)=d(-,A)=-1 for A!='-',3 else.all the weight factors are 1.

             seq1= A  G  _  C  T
             seq2= A  G  _  C  T
             seq3= A  C  T  _  T
                   3  3  3  3  3
                   3  2 -1 -1  3
                   3  2 -1 -1  3
                   ++++++++++++++
                  (3+3+3)*1+(3+2+2)*1+(3-1-1)*1+(3-1-1)*1+(3+3+3)*1=9*1+7*1+1*1+1*1+9*1
                   9+7+1+1+9=27

So, WSP score of the three sequences is 27


Solution

  • I would approach this as follows. First, create functions to calculate the individual distances and weighted sums:

    def distance(a, b):
        """Distance between two bases a and b."""
        if a == b:
            return 0
        elif a == "_" or b == "_":
            return 1
        else:
            return 2
    

    and

    def w_sum(a, b, w=1):
        """Calculate the pair sum of bases a and b with weighting w."""
        if a == b:
            return 3 * w
        elif a == "_" or b == "_":
            return -1 * w
        else:
            return 2 * w
    

    Second, create the sets of bases at the same positions using the zip function:

    list(zip(seq1, seq2, seq3)) == [('A', 'A', 'A'), 
                                    ('G', 'G', 'C'), 
                                    ('_', '_', 'T'), 
                                    ('C', 'C', '_'), 
                                    ('T', 'T', 'T')]
    

    Third, generate the pairs within each position using itertools.combinations:

    list(combinations(('G', 'G', 'C'), 2)) == [('G', 'G'), 
                                               ('G', 'C'), 
                                               ('G', 'C')]
    

    Finally, add up the distances and sums:

    from itertools import combinations
    
    consensus = 0
    wsp = 0
    for position in zip(seq1, seq2, seq3): # sets at same position
        for pair in combinations(position, 2): # pairs within set
            consensus+= distance(*pair) # calculate distance
            wsp += w_sum(*pair) # calculate pair sum
    

    Note the use of *pair to unpack the 2-tuples of base pairs into the two arguments of the calculation functions.