I am working on a Python script to process large CSV files (ranging from 2GB to 10GB) and am encountering significant memory usage issues. The script reads a CSV file, performs various transformations on the data, and then writes the transformed data to a new CSV file. The transformations include filtering rows based on certain criteria, mapping values to new formats, and aggregating data from multiple rows.
To handle the CSV files, I initially used the pandas library due to its powerful data manipulation features. Here's a simplified version of my code:
import pandas as pd
def process_csv(input_file, output_file):
df = pd.read_csv(input_file)
filtered_df = df[df['column_name'] > some_value]
filtered_df.to_csv(output_file, index=False)
process_csv('large_input.csv', 'transformed_output.csv')
Although this approach works well for smaller files, it results in excessive memory usage for larger files, causing my script to crash on machines with limited memory.
Expected vs. Actual Results: I expected pandas to be able to efficiently handle large files through its various optimization options (like chunking). However, even after trying to process the file in chunks, I'm still facing out-of-memory errors.
Question: How can I optimize my Python script to reduce memory usage when processing large CSV files? Are there any best practices for using pandas with large datasets, or should I consider alternative libraries or techniques specifically suited for this scale of data processing?
Specific Challenges: How to efficiently filter and transform data without loading the entire file into memory. Best practices for writing the transformed data to a new CSV file in a memory-efficient manner.
If your dataset has any categorical variable, you can utilize category dtype provided on pandas. Use "dtype" parameter in read_csv() for minimizing memory usage a little bit.
# Without categoric variables
df = pd.read_csv("dataframe.csv")
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1615940 entries, 0 to 1615939 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job Id 1615940 non-null int64 1 Experience 1615940 non-null object 2 Qualifications 1615940 non-null object 3 Salary Range 1615940 non-null object 4 location 1615940 non-null object 5 Country 1615940 non-null object 6 latitude 1615940 non-null float64 7 longitude 1615940 non-null float64 8 Work Type 1615940 non-null object 9 Company Size 1615940 non-null int64 10 Job Posting Date 1615940 non-null object 11 Preference 1615940 non-null object 12 Contact Person 1615940 non-null object 13 Contact 1615940 non-null object 14 Job Title 1615940 non-null object 15 Role 1615940 non-null object 16 Job Portal 1615940 non-null object 17 Job Description 1615940 non-null object 18 Benefits 1615940 non-null object 19 skills 1615940 non-null object 20 Responsibilities 1615940 non-null object 21 Company 1615940 non-null object 22 Company Profile 1610462 non-null object dtypes: float64(2), int64(2), object(19) memory usage: 3.2 GB
# With categoric variables
cats = {"Work Type":"category", "Job Portal":"category"}
df = pd.read_csv("dataframe.csv",dtype=cats)
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1615940 entries, 0 to 1615939 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job Id 1615940 non-null int64 1 Experience 1615940 non-null object 2 Qualifications 1615940 non-null object 3 Salary Range 1615940 non-null object 4 location 1615940 non-null object 5 Country 1615940 non-null object 6 latitude 1615940 non-null float64 7 longitude 1615940 non-null float64 8 Work Type 1615940 non-null category 9 Company Size 1615940 non-null int64 10 Job Posting Date 1615940 non-null object 11 Preference 1615940 non-null object 12 Contact Person 1615940 non-null object 13 Contact 1615940 non-null object 14 Job Title 1615940 non-null object 15 Role 1615940 non-null object 16 Job Portal 1615940 non-null category 17 Job Description 1615940 non-null object 18 Benefits 1615940 non-null object 19 skills 1615940 non-null object 20 Responsibilities 1615940 non-null object 21 Company 1615940 non-null object 22 Company Profile 1610462 non-null object dtypes: category(2), float64(2), int64(2), object(17) memory usage: 3.0 GB
In addition, you can use chunksize parameter like dtype parameter used on read_csv() above. You can find more information about read_csv() on pandas original documentation: pd.read_csv