pythonpandasperformancememory-managementlarge-data

How can I efficiently handle filtering and processing large datasets in Python with limited memory?


I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several conditions, then process the filtered results. The main challenge is that the dataset is too large to fit into memory all at once, and I need an efficient solution to both filter and process the data in a memory-conscious manner.

Here’s a simplified version of what I’m trying to achieve:

Filter records where age > 25 and status == 'active'. For the filtered records, extract certain fields, such as name and email, and process them (e.g., convert names to lowercase, extract domain from emails).

# Sample dataset
data = [
    {'name': 'Alice', 'age': 30, 'status': 'active', 'email': 'alice@example.com'},
    {'name': 'Bob', 'age': 22, 'status': 'inactive', 'email': 'bob@example.com'},
    {'name': 'Charlie', 'age': 35, 'status': 'active', 'email': 'charlie@example.com'},
    # More records...
]

# Attempted approach
def process_record(record):
    # Process the record, e.g., lowercase name, extract email domain
    record['name'] = record['name'].lower()
    record['email_domain'] = record['email'].split('@')[1]
    return record

filtered_and_processed = []
for record in data:
    if record['age'] > 25 and record['status'] == 'active':
        processed_record = process_record(record)
        filtered_and_processed.append(processed_record)

# Output the results
print(filtered_and_processed)


Solution

  • The best way to handle this would be through using iteration over a stream of records rather than aggregating them all into memory. There are a few ways you can do this.

    Approach 1: Record-by-Record Stream

    You can accomplish this by using the csv module and iterate directly over the file handle:

    import csv
    
    with open('yourfile.csv', newline='') as infile, open('outfile.csv', 'w', newline='') as outfile:
         reader = csv.DictReader(infile)
         writer = csv.DictWriter(outfile)
    
         # Iterate directly over the reader
         for row in reader:
             if row['age'] < 25 and not row['status'] == 'active':
                 continue
            
             # Write each row, don't aggregate
             writer.writerow(process_record(row))
    

    This will handle all of your records in a stream and is very memory efficient.

    Approach 2: Use pandas batching from a csv file:

    This utilizes the pandas library to batch a certain amount of records per iterative step (1M as an example). This could leverage vectorization for your processing, but I have not made edits to vectorize your approach

    import pandas as pd
    
    with pd.read_csv('yourfile.csv', chunksize=10**6) as reader:
        # Iterate over the chunks
        for chunk in reader:
            # do the filtering first, then apply your function
            df = chunk[(chunk['age'] > 25) & (chunk['status'] == 'active')].apply(process_record)
            # append to the outfile.csv
            df.to_csv('outfile.csv', mode='a', index=False, header=False)