pythonsequencing

Split text to phrases and enumerate them


I have this sequence:

>my_sequence
atccagcaaaaacgctccaaggattctcgactggactcattacttaatcagtattcgcaagcggacgccgaggtcgtaaaggctgaaaccgcacaatcggatgcgcccagtgatgacgcactxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcgccttgcccacccaccgacaaccggtgagtgaaaaattggaacggtgattaaaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxttgtgctttatttctggagggcggtgtttaggggtaggcgcgccatgttttttgccttcagcgatcccaggtacaaccagtccccatattcgcgcactgtcgtgatcggcgagtaattacctgtgctcgcatcttgcaggttggcaatcaccttgccgtccaagtccagacccagtgcaaaggcacgcttttccatgggtttgggcagtaccgtcaatgcccgaacaatcattttgc

I want to split this long sequence eliminating the "xxxxx" and create separated sequences like this:

>1
atccagcaaaaacgctccaaggattctcgactggactcattacttaatcagtattcgcaagcggacgccgaggtcgtaaaggctgaaaccgcacaatcggatgcgcccagtgatgacgcact
>2
cgccttgcccacccaccgacaaccggtgagtgaaaaattggaacggtgattaaa  
>3
ttgtgctttatttctggagggcggtgtttaggggtaggcgcgccatgttttttgccttcagcgatcccaggtacaaccagtccccatattcgcgcactgtcgtgatcggcgagtaattacctgtgctcgcatcttgcaggttggcaatcaccttgccgtccaagtccagacccagtgcaaaggcacgcttttccatgggtttgggcagtaccgtcaatgcccgaacaatcattttgc

Does anyone have any idea to start?

Thank you.


Solution

  • A simple way would be to first split on each "x" character, and then filter out the empty results:

    sequences = filter(None, my_sequence.split("x"))
    

    Here, the None argument to filter means to only keep truthy-values – empty strings are treated as false, so they will be removed from the results.

    Note: In Python 3, filter returns an iterator so if you want a list, use:

    sequences = list(filter(None, my_sequence.split("x")))
    

    For example:

    In [5]: filter(str, my_sequence.split("x"))
    Out[5]: 
    ['atccagcaaaaacgctccaaggattctcgactggactcattacttaatcagtattcgcaagcggacgccgaggtcgtaaaggctgaaaccgcacaatcggatgcgcccagtgatgacgcact',
     'cgccttgcccacccaccgacaaccggtgagtgaaaaattggaacggtgattaaa',
     'ttgtgctttatttctggagggcggtgtttaggggtaggcgcgccatgttttttgccttcagcgatcccaggtacaaccagtccccatattcgcgcactgtcgtgatcggcgagtaattacctgtgctcgcatcttgcaggttggcaatcaccttgccgtccaagtccagacccagtgcaaaggcacgcttttccatgggtttgggcagtaccgtcaatgcccgaacaatcattttgc']
    

    Another solution is to use regular expressions. If you have a variable amount of "x" characters between the sequences, you can split on the x+ pattern, which matches one or more x's in a row.

    For example:

    In [6]: import re
    In [7]: p = re.compile(r'x+')
    In [8]: p.split(my_sequence)
    Out[8]: 
    ['atccagcaaaaacgctccaaggattctcgactggactcattacttaatcagtattcgcaagcggacgccgaggtcgtaaaggctgaaaccgcacaatcggatgcgcccagtgatgacgcact',
     'cgccttgcccacccaccgacaaccggtgagtgaaaaattggaacggtgattaaa',
     'ttgtgctttatttctggagggcggtgtttaggggtaggcgcgccatgttttttgccttcagcgatcccaggtacaaccagtccccatattcgcgcactgtcgtgatcggcgagtaattacctgtgctcgcatcttgcaggttggcaatcaccttgccgtccaagtccagacccagtgcaaaggcacgcttttccatgggtttgggcagtaccgtcaatgcccgaacaatcattttgc']