I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.
I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.
When I compared the running time of 2 scripts, I found the following:
script 1 - use generator - take more time - 0.0155925750732s
def each_sentence(text):
match = re.match(r'[0-9]+', text)
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)
script 2 - use split and without generator - take less time - 0.00619888305664
with open("./file_testing.csv") as csvfile:
for line in csvfile:
array = line.split(',')
num = int(array[0])
if sympy.isprime(num) == False:
print line.strip()
To meet the requirement, do I need to use python generator? or any suggestions or recommendations?
To meet the requirement, do I need to use python generator?
No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.
Any suggestions or recommendations?
You need to learn about three things: complexity, parallelization and caching.
Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?
Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.
Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.
The main loop for line in csvfile:
already scales very well unless the csv file contains extremely long lines.
Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0])
will raise a value error.
The isprime
function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.