regexstringpython-3.xreplace

Alternative to Regex for large string format/replace


I have a very large string of key value pairs (old_string) that is formatted as so:

"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."

this string is very large since it can be up to 30k customers. I am using this to write a file to upload to an online segmentation tool that requires that it is formatted this way with one modification -- the primary key (visitorid) needs to be tab separated and not in quotes. The end result needs to look like this (note the 4 spaces is a tab):

gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks    "customer_name"="larry", "customer_state"="alabama",...ABC3k9sk-gj49-92ks-dgjs-j2ks-j29slgj9bbbb

I wrote the following function that does this fine, but ive noticed that this portion of the script runs the slowest (I am assuming because regex is generally slow).

def getGUIDS(old_string):
    '''
    Finds guids in the string and formats it as PK for syncfile
    @param old_string the string created from the nested dict
    @return old_string_fmt the formatted version
    '''

    print ('getting ids')
    ids = re.findall('("\w{8}-\w{4}-\w{4}-\w{4}-\w{12}",)', cat_string) #looks for GUID based on regex

    for element in ids:
      new_str = str(element.strip('"').strip('"').strip(",").strip('"') + ('\t'))
      old_string_fmt = old_string.replace(element, new_str)


    return old_string_fmt

Is there a way this can be done without regex that might speed this up?


Solution

  • The approach is wrong: you match all occurrences meeting your regex and then replace all occurrences with modified matches. You may simply use re.sub to find all non-overlapping matches and replace them with what you need.

    See this Python demo:

    import re
    
    def getGUIDS(old_string):
        '''
        Finds guids in the string and formats it as PK for syncfile
        @param old_string the string created from the nested dict
        @return old_string_fmt the formatted version
        '''
        print ('getting ids')
        return re.sub(r'"\w+"="(\w{8}(?:-\w{4}){4}-\w{12})"(?:,|$)', '\\1\t', old_string) #looks for GUID based on regex
    
    s='"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."'
    print(getGUIDS(s))
    # => getting ids
    # => gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks   "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."
    

    I added "\w+"= at the start of the regex to also match the key of the GUID value to remove it, replaced a , at the end with (?:,|$) to match either a , or end of string (to also handle cases when the key-value is the last one in the string) and enclosed the part you need to keep with capturing parentheses.

    The replacement is a backreference to the capturing group #1 and a tab char.