pythonstringbioinformaticsencoderle

Create regex-like (run length encoding) of string s for blocks of a given length k


I am looking for python code to perform a run length encoding to obtain a regex-like summary of a string s, for a known length k for the blocks. How should I tackle this?

e.g.

s=TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA

with k=5 could become

(TATTT)3(TATGT)9TACATTATTTTA

Solution

  • Instead of a regex pattern, you could split and group with itertools.groupby:

    from itertools import groupby
    
    s = "TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA"
    
    k = 5
    parts = [s[i:i+k] for i in range(0, len(s), k)]
    
    for k, g in groupby(parts):
        print(k, len(list(g)))
        
    

    For your specific string this would yield

    TATTT 3
    TATGT 9
    TACAT 1
    TATTT 1
    TA 1
    

    Or - if you need to stick with your specific format as well:

    lst = []
    for k, g in groupby(parts):
        _len = len(list(g))
        if _len > 1:
            lst.append(f"({k}){_len}")
        else:
            lst.append(k)
        
    
    print("".join(lst))
    # (TATTT)3(TATGT)9TACATTATTTTA