I have a very large string of key value pairs (old_string) that is formatted as so:
"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."
this string is very large since it can be up to 30k customers. I am using this to write a file to upload to an online segmentation tool that requires that it is formatted this way with one modification -- the primary key (visitorid) needs to be tab separated and not in quotes. The end result needs to look like this (note the 4 spaces is a tab):
gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks "customer_name"="larry", "customer_state"="alabama",...ABC3k9sk-gj49-92ks-dgjs-j2ks-j29slgj9bbbb
I wrote the following function that does this fine, but ive noticed that this portion of the script runs the slowest (I am assuming because regex is generally slow).
def getGUIDS(old_string):
'''
Finds guids in the string and formats it as PK for syncfile
@param old_string the string created from the nested dict
@return old_string_fmt the formatted version
'''
print ('getting ids')
ids = re.findall('("\w{8}-\w{4}-\w{4}-\w{4}-\w{12}",)', cat_string) #looks for GUID based on regex
for element in ids:
new_str = str(element.strip('"').strip('"').strip(",").strip('"') + ('\t'))
old_string_fmt = old_string.replace(element, new_str)
return old_string_fmt
Is there a way this can be done without regex that might speed this up?
The approach is wrong: you match all occurrences meeting your regex and then replace all occurrences with modified matches. You may simply use re.sub
to find all non-overlapping matches and replace them with what you need.
See this Python demo:
import re
def getGUIDS(old_string):
'''
Finds guids in the string and formats it as PK for syncfile
@param old_string the string created from the nested dict
@return old_string_fmt the formatted version
'''
print ('getting ids')
return re.sub(r'"\w+"="(\w{8}(?:-\w{4}){4}-\w{12})"(?:,|$)', '\\1\t', old_string) #looks for GUID based on regex
s='"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."'
print(getGUIDS(s))
# => getting ids
# => gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."
I added "\w+"=
at the start of the regex to also match the key of the GUID value to remove it, replaced a ,
at the end with (?:,|$)
to match either a ,
or end of string (to also handle cases when the key-value is the last one in the string) and enclosed the part you need to keep with capturing parentheses.
The replacement is a backreference to the capturing group #1 and a tab char.