I'm working with a large dataset (around 1 million records) represented as a list of dictionaries in Python. Each dictionary has multiple fields, and I need to filter the data based on several conditions, then process the filtered results. The main challenge is that the dataset is too large to fit into memory all at once, and I need an efficient solution to both filter and process the data in a memory-conscious manner.
Here’s a simplified version of what I’m trying to achieve:
Filter records where age > 25 and status == 'active'. For the filtered records, extract certain fields, such as name and email, and process them (e.g., convert names to lowercase, extract domain from emails).
# Sample dataset
data = [
{'name': 'Alice', 'age': 30, 'status': 'active', 'email': 'alice@example.com'},
{'name': 'Bob', 'age': 22, 'status': 'inactive', 'email': 'bob@example.com'},
{'name': 'Charlie', 'age': 35, 'status': 'active', 'email': 'charlie@example.com'},
# More records...
]
# Attempted approach
def process_record(record):
# Process the record, e.g., lowercase name, extract email domain
record['name'] = record['name'].lower()
record['email_domain'] = record['email'].split('@')[1]
return record
filtered_and_processed = []
for record in data:
if record['age'] > 25 and record['status'] == 'active':
processed_record = process_record(record)
filtered_and_processed.append(processed_record)
# Output the results
print(filtered_and_processed)
The best way to handle this would be through using iteration over a stream of records rather than aggregating them all into memory. There are a few ways you can do this.
You can accomplish this by using the csv
module and iterate directly over the file handle:
import csv
with open('yourfile.csv', newline='') as infile, open('outfile.csv', 'w', newline='') as outfile:
reader = csv.DictReader(infile)
writer = csv.DictWriter(outfile)
# Iterate directly over the reader
for row in reader:
if row['age'] < 25 and not row['status'] == 'active':
continue
# Write each row, don't aggregate
writer.writerow(process_record(row))
This will handle all of your records in a stream and is very memory efficient.
pandas
batching from a csv file:This utilizes the pandas
library to batch a certain amount of records per iterative step (1M as an example). This could leverage vectorization for your processing, but I have not made edits to vectorize your approach
import pandas as pd
with pd.read_csv('yourfile.csv', chunksize=10**6) as reader:
# Iterate over the chunks
for chunk in reader:
# do the filtering first, then apply your function
df = chunk[(chunk['age'] > 25) & (chunk['status'] == 'active')].apply(process_record)
# append to the outfile.csv
df.to_csv('outfile.csv', mode='a', index=False, header=False)