python pandas dataframe numpy optimization

How can I optimize Python code for analysis of a large sales dataset?

I’m working on a question where I have to process a large set of sales transactions stored in a CSV file and summarize the results. The code is running slower than expected and taking too much time for execution, especially as the size of the dataset increases. I am using pandas to load and process the data, are there any optimizations I can make to reduce computational time and get the output faster. Here is the code i am using:

import pandas as pd
import numpy as np

# Sample dataset
n = 10**6  # million rows
np.random.seed(0)
transaction_ids = np.arange(1, n+1)
customer_ids = np.random.randint(100, 200, n)
sale_amounts = np.random.uniform(50, 500, n)
transaction_dates = pd.date_range('2023-01-01', periods=n, freq='T')

# DataFrame
df = pd.DataFrame({
    'transaction_id': transaction_ids,
    'customer_id': customer_ids,
    'sale_amount': sale_amounts,
    'transaction_date': transaction_dates
})

# Categorization function
def categorize_transaction(sale_amount):
    if sale_amount > 400:
        return 'High Value'
    elif sale_amount > 200:
        return 'Medium Value'
    else:
        return 'Low Value'

category_map = {
    'High Value': (df['sale_amount'] > 400),
    'Medium Value': (df['sale_amount'] > 200) & (df['sale_amount'] <= 400),
    'Low Value': (df['sale_amount'] <= 200)
}

df['category'] = np.select(
    [category_map['High Value'], category_map['Medium Value'], category_map['Low Value']], 
    ['High Value', 'Medium Value', 'Low Value'],
    default='Unknown'
)

# Aggregation
category_summary = df.groupby('category')['sale_amount'].agg(
    total_sales='sum', 
    avg_sales='mean', 
    transaction_count='count'
).reset_index()

# Additional optimization using 'transaction_date' for time-based grouping
df['transaction_month'] = df['transaction_date'].dt.to_period('M')
monthly_summary = df.groupby(['transaction_month', 'category'])['sale_amount'].agg(
    total_sales='sum', 
    avg_sales='mean', 
    transaction_count='count'
).reset_index()

print(category_summary.head())
print(monthly_summary.head())

Solution

First of all, the df['category'] = np.select(...) line is slow because of the implicit conversion of all strings to a list of string objects. You can strongly speed this up by creating a categorical column rather than string-based one, since strings are inherently slow to compute.

df['category'] = pd.Categorical.from_codes(np.select(
    [category_map['High Value'], category_map['Medium Value'], category_map['Low Value']],
    [0, 1, 2],
    default=3
), ['High Value', 'Medium Value', 'Low Value', 'Unknown'])

This create a categorical column with 4 possible values (integers associated to predefined strings). This is about 8 times faster on my machine.

Once you use the above code, the aggregation is also running much faster (about 5 times) because Pandas operates on integers rather than slow string objets. It also speed up the very-last operation (about twice faster).

The df['transaction_date'].dt.to_period('M') is particularly slow. Directly using Numpy (with .astype('datetime64[M]')) does not make this faster. Since this operation is compute bound, you can parallelize it. Alternatively, you can write your own (parallel) implementation with Numba (or Cython) though this is tedious to write since one need to case about leap years (and possibly even leap seconds).

Update: You can make the first code even faster thanks to 8-bit integers (assuming there are less than 128 categories). This can be done by replacing [0, 1, 2] to np.array([0, 1, 2], dtype=np.int8). This is about 35% faster than the default 32-bit categories.