I'm looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I've used.
findall() feels simpler and more elegant, but it takes an astounding amount of time.
finditer() blazes through a large file, but I'm not sure this is the right way to do it.
Here's some sample code. Note that the actual text I'm interested in is a single string around 10MB in size, and there's a huge difference in these two methods.
import re
def findall_replace(text, reg, rep):
for match in reg.findall(text):
output = text.replace(match, rep)
return output
def finditer_replace(text, reg, rep):
cursor_pos = 0
output = ''
for match in reg.finditer(text):
output += "".join([text[cursor_pos:match.start(1)], rep])
cursor_pos = match.end(1)
output += "".join([text[cursor_pos:]])
return output
reg = re.compile(r'(dog)')
rep = 'cat'
text = 'dog cat dog cat dog cat'
finditer_replace(text, reg, rep)
findall_replace(text, reg, rep)
UPDATE Added re.sub method to tests:
def sub_replace(reg, rep, text):
output = re.sub(reg, rep, text)
return output
Results
re.sub() - 0:00:00.031000
finditer() - 0:00:00.109000
findall() - 0:01:17.260000
The standard method is to use the built-in
re.sub(reg, rep, text)
Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.