pythongeneratoryield-keyword

Python - scalability with respect to run time and memory usage is important


I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.

I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.

When I compared the running time of 2 scripts, I found the following:

script 1 - use generator - take more time - 0.0155925750732s

def each_sentence(text):
    match = re.match(r'[0-9]+', text)
    num = int(text[match.start():match.end()])
    if sympy.isprime(num) == False:
        yield text.strip()

with open("./file_testing.csv") as csvfile:
    for line in csvfile:
        for text in each_sentence(line):
            print(text)

script 2 - use split and without generator - take less time - 0.00619888305664

with open("./file_testing.csv") as csvfile:
for line in csvfile:
    array = line.split(',')
    num = int(array[0])
    if sympy.isprime(num) == False:
        print line.strip()

To meet the requirement, do I need to use python generator? or any suggestions or recommendations?


Solution

  • To meet the requirement, do I need to use python generator?

    No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.

    Any suggestions or recommendations?

    You need to learn about three things: complexity, parallelization and caching.

    The main loop for line in csvfile: already scales very well unless the csv file contains extremely long lines.

    Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0]) will raise a value error.

    The isprime function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.