pythonpandascsvdata-processing

How to Optimize Memory Usage When Processing Large CSV Files in Python?


I am working on a Python script to process large CSV files (ranging from 2GB to 10GB) and am encountering significant memory usage issues. The script reads a CSV file, performs various transformations on the data, and then writes the transformed data to a new CSV file. The transformations include filtering rows based on certain criteria, mapping values to new formats, and aggregating data from multiple rows.

To handle the CSV files, I initially used the pandas library due to its powerful data manipulation features. Here's a simplified version of my code:

import pandas as pd

def process_csv(input_file, output_file):
    df = pd.read_csv(input_file)
    filtered_df = df[df['column_name'] > some_value]
    filtered_df.to_csv(output_file, index=False)

process_csv('large_input.csv', 'transformed_output.csv')

Although this approach works well for smaller files, it results in excessive memory usage for larger files, causing my script to crash on machines with limited memory.

Expected vs. Actual Results: I expected pandas to be able to efficiently handle large files through its various optimization options (like chunking). However, even after trying to process the file in chunks, I'm still facing out-of-memory errors.

Question: How can I optimize my Python script to reduce memory usage when processing large CSV files? Are there any best practices for using pandas with large datasets, or should I consider alternative libraries or techniques specifically suited for this scale of data processing?

Specific Challenges: How to efficiently filter and transform data without loading the entire file into memory. Best practices for writing the transformed data to a new CSV file in a memory-efficient manner.


Solution

  • If your dataset has any categorical variable, you can utilize category dtype provided on pandas. Use "dtype" parameter in read_csv() for minimizing memory usage a little bit.

    # Without categoric variables
    df = pd.read_csv("dataframe.csv")
    df.info(memory_usage="deep")
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1615940 entries, 0 to 1615939
    Data columns (total 23 columns):
     #   Column            Non-Null Count    Dtype  
    ---  ------            --------------    -----  
     0   Job Id            1615940 non-null  int64  
     1   Experience        1615940 non-null  object 
     2   Qualifications    1615940 non-null  object 
     3   Salary Range      1615940 non-null  object 
     4   location          1615940 non-null  object 
     5   Country           1615940 non-null  object 
     6   latitude          1615940 non-null  float64
     7   longitude         1615940 non-null  float64
     8   Work Type         1615940 non-null  object 
     9   Company Size      1615940 non-null  int64  
     10  Job Posting Date  1615940 non-null  object 
     11  Preference        1615940 non-null  object 
     12  Contact Person    1615940 non-null  object 
     13  Contact           1615940 non-null  object 
     14  Job Title         1615940 non-null  object 
     15  Role              1615940 non-null  object 
     16  Job Portal        1615940 non-null  object 
     17  Job Description   1615940 non-null  object 
     18  Benefits          1615940 non-null  object 
     19  skills            1615940 non-null  object 
     20  Responsibilities  1615940 non-null  object 
     21  Company           1615940 non-null  object 
     22  Company Profile   1610462 non-null  object 
    dtypes: float64(2), int64(2), object(19)
    memory usage: 3.2 GB
    
    # With categoric variables
    cats = {"Work Type":"category", "Job Portal":"category"}
    df = pd.read_csv("dataframe.csv",dtype=cats)
    df.info(memory_usage="deep")
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1615940 entries, 0 to 1615939
    Data columns (total 23 columns):
     #   Column            Non-Null Count    Dtype   
    ---  ------            --------------    -----   
     0   Job Id            1615940 non-null  int64   
     1   Experience        1615940 non-null  object  
     2   Qualifications    1615940 non-null  object  
     3   Salary Range      1615940 non-null  object  
     4   location          1615940 non-null  object  
     5   Country           1615940 non-null  object  
     6   latitude          1615940 non-null  float64 
     7   longitude         1615940 non-null  float64 
     8   Work Type         1615940 non-null  category
     9   Company Size      1615940 non-null  int64   
     10  Job Posting Date  1615940 non-null  object  
     11  Preference        1615940 non-null  object  
     12  Contact Person    1615940 non-null  object  
     13  Contact           1615940 non-null  object  
     14  Job Title         1615940 non-null  object  
     15  Role              1615940 non-null  object  
     16  Job Portal        1615940 non-null  category
     17  Job Description   1615940 non-null  object  
     18  Benefits          1615940 non-null  object  
     19  skills            1615940 non-null  object  
     20  Responsibilities  1615940 non-null  object  
     21  Company           1615940 non-null  object  
     22  Company Profile   1610462 non-null  object  
    dtypes: category(2), float64(2), int64(2), object(17)
    memory usage: 3.0 GB
    

    In addition, you can use chunksize parameter like dtype parameter used on read_csv() above. You can find more information about read_csv() on pandas original documentation: pd.read_csv