pythonoptimizationfaker

Optimizing python loops for millions of rows


I'm trying to simulate test datasets using Python-Faker. The goal is to have a few million records for my use-case. Following is the code I use to populate 5 data elements for 1 Million records.

for i in range(500000):
    df = df.append(
        {'COL1': fake.first_name_female(),
         'COL2': fake.last_name_female(),
         'COL3': 'F',
         'COL4': fake.street_address(),
         'COL5': fake.zipcode_in_state()
         }, ignore_index=True)
    df = df.append(
        {'COL1': fake.first_name_male(),
         'COL2': fake.last_name_male(),
         'COL3': 'M',
         'COL4': fake.street_address(),
         'COL5': fake.zipcode_in_state()
         }, ignore_index=True)

It took nearly 8 hours to run this. How could I optimize this loop to run faster?


Solution

  • Conventional way of implementing this,

    import pandas as pd
    from time import time
    from faker import Faker
    fake = Faker()
    
    def fake_row(i):
        if i % 2 == 0:
            row = [fake.first_name_female(), fake.last_name_female(), 'F', fake.street_address(), fake.zipcode_in_state()]
        else:
            row = [fake.first_name_male(), fake.last_name_male(), 'M', fake.street_address(), fake.zipcode_in_state()]
        return row
    
    start = time()
    fake_data = [fake_row(i) for i in range(500000)]
    df = pd.DataFrame(fake_data, columns=['COL1', 'COL2', 'COL3', 'COL4', 'COL5'])
    print('[TIME]', time() - start)
    [TIME] 171.82 secs
    

    To improve the efficiency of the code, you can use parallelization. The deco library provides a convenient and easy way of making your code run concurrently.

    import pandas as pd
    from time import time
    from faker import Faker
    from deco import concurrent, synchronized
    fake = Faker()
    
    @concurrent
    def fake_row(i):
        if i % 2 == 0:
            row = [fake.first_name_female(), fake.last_name_female(), 'F', fake.street_address(), fake.zipcode_in_state()]
            return row
        else:
            row = [fake.first_name_male(), fake.last_name_male(), 'M', fake.street_address(), fake.zipcode_in_state()]
            return row
    
    @synchronized
    def run(size):
        res = []
        for i in range(size):
            res.append(fake_row(i))
        return pd.DataFrame(res, columns=['COL1', 'COL2', 'COL3', 'COL4', 'COL5'])
        
    start = time()
    df = run(500000)
    print('[TIME]', time() - start)
    [TIME] 88.11 secs