I am looking for python code to perform a run length encoding to obtain a regex-like summary of a string s, for a known length k for the blocks. How should I tackle this?
e.g.
s=TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA
with k=5 could become
(TATTT)3(TATGT)9TACATTATTTTA
Instead of a regex pattern, you could split and group with itertools.groupby
:
from itertools import groupby
s = "TATTTTATTTTATTTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTATGTTACATTATTTTA"
k = 5
parts = [s[i:i+k] for i in range(0, len(s), k)]
for k, g in groupby(parts):
print(k, len(list(g)))
For your specific string this would yield
TATTT 3
TATGT 9
TACAT 1
TATTT 1
TA 1
Or - if you need to stick with your specific format as well:
lst = []
for k, g in groupby(parts):
_len = len(list(g))
if _len > 1:
lst.append(f"({k}){_len}")
else:
lst.append(k)
print("".join(lst))
# (TATTT)3(TATGT)9TACATTATTTTA