pythonbampysam

sum elements in python list if match condition


I have a variable with lists with varied number of elements:

['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M']
['124', 'M', '19', 'M', '7', 'M']
['19', 'M', '131', 'M']
['3', 'M', '19', 'M', '128', 'M']
['12', 'M', '138', 'M']

Variable is always number, letter and order matters.

I would to add the values only of consecutive Ms to be (i.e. if there is a D, skip the sum):

['30', 'M', '1', 'D', '120', 'M']
['150', 'M']
['150', 'M']
['150', 'M']
['150', 'M']

ps. the complete story is that I want to convert soft clips to match in a bam file, but got stuck in that step.

#!/usr/bin/python

import sys 
import pysam

bamFile = sys.argv[1];

bam = pysam.AlignmentFile(bamFile, 'rb')

for read in bam:
    cigar=read.cigarstring
    sepa = re.findall('(\d+|[A-Za-z]+)', cigar)
    
    for i in range(len(sepa)):
        if sepa[i] == 'S':
            sepa[i] = 'M'
            

Solution

  • You can slice Python lists using a step (sometimes called a stride), you can use this to get every second element, starting at index 1 (for the first letter):

    >>> example = ['30', 'M', '1', 'D', '120', 'M']
    >>> example[1::2]
    ['M', 'D', 'M']
    

    The [1::2] syntax means: start at index 1, go on until you run out of elements (nothing entered between the : delimiters), and step over the list to return every second value.

    You can do the same thing for the numbers, using [::2], so begin with the value right at the start and take every other value.

    If you then combine this with the zip() function you can pair up your numbers and letters to figure out what to sum:

    def sum_m_values(values):
        summed = []
        m_sum = 0
        for number, letter in zip(values[::2], values[1::2]):
            if letter != "M":
                if m_sum:
                    summed += (str(m_sum), "M")
                    m_sum = 0
                summed += (number, letter)
            else:
                m_sum += int(number)
        if m_sum:
            summed += (str(m_sum), "M")
        return summed
    

    The above function takes your list of numbers and letters and:

    This covers all your example inputs:

    >>> def sum_m_values(values):
    ...     summed = []
    ...     m_sum = 0
    ...     for number, letter in zip(values[::2], values[1::2]):
    ...         if letter != "M":
    ...             if m_sum:
    ...                 summed += (str(m_sum), "M")
    ...                 m_sum = 0
    ...             summed += (number, letter)
    ...         else:
    ...             m_sum += int(number)
    ...     if m_sum:
    ...         summed += (str(m_sum), "M")
    ...     return summed
    ...
    >>> examples = [
    ...     ['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'],
    ...     ['124', 'M', '19', 'M', '7', 'M'],
    ...     ['19', 'M', '131', 'M'],
    ...     ['3', 'M', '19', 'M', '128', 'M'],
    ...     ['12', 'M', '138', 'M'],
    ... ]
    >>> for example in examples:
    ...     print(example, "->", sum_m_values(example))
    ...
    ['20', 'M', '10', 'M', '1', 'D', '14', 'M', '106', 'M'] -> ['30', 'M', '1', 'D', '120', 'M']
    ['124', 'M', '19', 'M', '7', 'M'] -> ['150', 'M']
    ['19', 'M', '131', 'M'] -> ['150', 'M']
    ['3', 'M', '19', 'M', '128', 'M'] -> ['150', 'M']
    ['12', 'M', '138', 'M'] -> ['150', 'M']
    

    There are other methods of looping over a list in fixed-sized groups; you can also create an iterator for the list with iter()and then use zip() to pull in consecutive elements into pairs:

    it = iter(inputlist)
    for number, letter in zip(it, it):
        # ...
    

    This works because zip() gets the next element for each value in the pair from the same iterator, so "30" first, then "M", etc.:

    >>> example = ['124', 'M', '19', 'M', '7', 'M']
    >>> it = iter(example)
    >>> for number, letter in zip(it, it):
    ...     print(number, letter)
    ...
    124 M
    19 M
    7 M
    

    However, for short lists it is perfectly fine to use slicing, as it can be understood more easily.

    Next, you can make the summing a little easier by using the itertools.groupby() function to give you your number + letter pairs as separate groups. That function takes an input sequence, and a function to produce the group identifier. When you then loop over its output you are given that group identifier and an iterator to access the group members (those elements that have the same group value).

    Just pass it the zip() iterator build before, and either lambda pair: pair[1] or operator.itemgetter(1); the latter is a little faster but does the same thing as the lambda, get the letter from the number + letter pair.

    With separate groups, the logic starts to look a lot simpler:

    from itertools import groupby
    from operator import itemgetter
    
    def sum_m_values(values):
        summed = []
        it = iter(values)
        paired = zip(it, it)
    
        for letter, grouped in groupby(paired, itemgetter(1)):
            if letter == "M":
                total = sum(int(number) for number, _ in grouped)
                summed += (str(total), letter)
            else:
                # add the (number, "D") as separate elements
                for number, letter in grouped:
                    summed += (number, letter)
                
        return summed
    

    The output of the function hasn't changed, only the implementation.

    Finally, we could turn the function into a generator function, by replacing the summed += ... statements with yield from ..., so it'll still generate a sequence of numeric strings and letters:

    from itertools import groupby
    from operator import itemgetter
    
    def sum_m_values(values):
        it = iter(values)
        paired = zip(it, it)
    
        for letter, grouped in groupby(paired, itemgetter(1)):
            if letter == "M":
                total = sum(int(number) for number, _ in grouped)
                yield from (str(total), letter)
            else:
                # add the (number, "D") as separate elements
                for number, letter in grouped:
                    yield from (number, letter)
    
    

    You can then use list(sum_m_values(...)) to get a list again, or just use the generator as-is. For long inputs, that could be the preferred option as that means you never need to keep everything in memory all at once.

    If you can guarantee that only numbers with M repeat (so a D pair is always followed by an M pair or is the last pair in the sequence), you can even just drop the if test and just always sum:

    from itertools import groupby
    from operator import itemgetter
    
    def sum_m_values(values):
        it = iter(values)
        paired = zip(it, it)
    
        for letter, grouped in groupby(paired, itemgetter(1)):
            yield str(sum(int(number) for number, _ in grouped))
            yield letter
    

    This works because there will only ever be one number value per D group, summing won’t make that into a different number.