So I used dis library (disassembler for Python bytecode) to understand what causes such an interesting performance difference which is observed at different scales, even with 1000s of strings unlike the one string example below. However, I am still clueless.
I used both standard time library and timeit for the calculations.
import timeit
def find_overlapped_instances(regexx, string):
found_any = False
for start in range(len(string)):
if(found_any):#addition
break
for end in range(start, len(string)):
cur = string[start:end+1]
if re.match(regex, cur):
found_any = True
break
return found_any
#regex = re.compile(r'F.{4}D.AX')
regex = r'^{}$'.format(r'F.{4}D.AX')
fasta_file = "voodoo.fasta"
total_time = 0
num_hits = 0
for record in SeqIO.parse(fasta_file, "fasta"):
sequence = str(record.seq)
starttime = timeit.default_timer()
found_1 = find_overlapped_instances(regex, sequence)
total_time += timeit.default_timer()-starttime
if(found_1):
num_hits += 1
print(num_hits)
print(total_time)
Formatting seems to affect the performance in search (which is the main and nearly only source of computation) in a good way which somehow makes a ~25% performance difference in general:
Searching the pattern in n strings | re.compile() | r'^{}$'.format() |
---|---|---|
n = 200 | 64.3 seconds | 47.1 seconds |
n = 400 | 136.7 seconds | 101.7 seconds |
Can someone help me understand why this might be happening?
Apart from the fact that the test should use the same regex (i.e. both with or without the ^
and $
anchors), the main problem is that when you have the compiled regular expression object, you should not pass it as first argument to re.match
, but call match
on that object. Only then you will benefit from that compilation.
Here is a simpler test that shows this difference, using three scenarios:
match
on the compiled regular expression object (fastest)re.match
with the pattern string as argument (medium)re.match
with the compiled regular expression object as argument (worst)Here is a script for testing that:
import re
import timeit
pattern = r'F.{4}D.AX'
compiled = re.compile(pattern)
s = "qsdf F9999D9AX los F999D99AX"
def test1():
compiled.match(s)
def test2():
re.match(pattern, s)
def test3():
re.match(compiled, s)
for test in (test1, test2, test3):
print(timeit.timeit(test, number=1000000))
Here is the output I got:
0.7618284979998862
2.8180372250008077
4.692441613999108