pythonregexstringreplacerun-length-encoding

Replace a series of the same character by its number of occurrences in the series


I get a string like this:

AABBBB$CCCDEEE$AABADEE

And I want a result like this:

2A4B$3CD3E$2ABAD2E

To do that, I made a for loop on the string array. It works well:

import re

string = "AABBBB$CCCDEEE$AABADEE"
out_string = string[:]
k = 1
c_old = ""
for c in string:
    if c_old==c :
        k += 1
    else:
        if k>1:
            s= ""
            for i in range(k):
                s += c_old
            chg = str(k) + c_old
            out_string = re.sub(s, chg, out_string, 1)
        k = 1
    c_old = c

print(out_string)

But with very long strings, it can take a long time.

Is there a way to do what I want without iterating all the string, especially with the re module?


Solution

  • Not sure why you think re.sub() is appropriate for this. You just need a fairly trivial iteration over the source string.

    Something like this:

    s = "AABBBB$CCCDEEE$AABADEE"
    
    r = ""
    c = 1
    p = s[0]
    
    for x in s[1:]:
        if x == p:
            c += 1
        else:
            if c == 1:
                r += p
            else:
                r += f"{c}{p}"
                c = 1
            p = x
    else:
        r += p if c == 1 else f"{c}{p}"
    
    print(r)
    

    Output:

    2A4B$3CD3E$2ABAD2E