pythonpandasnumpyperformancescikit-learn

How to train Random Forest classifier with large dataset to avoid memory errors in Python?


I have a dataset that is 30 million rows. I have two columns: one that contains a 1 or 0 label and the other column has a list of 1280 features for each row (181 GB total). All I want to do is plug this dataset into a Random Forest algorithm, but the memory runs out and it crashes (I've tried using memory of 400 GB, but it still crashes).

After loading the dataset, I had to manipulate it a bit since it is in the Hugging Face arrow format: https://huggingface.co/docs/datasets/en/about_arrow (I suspect this is taking up a lot of RAM).

I am aware I could do some dimensionality reduction to my dataset, but is there any changes I should make to my code to reduce RAM usage?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, auc
from datasets import load_dataset, Dataset

# Load dataset
df = Dataset.from_file("data.arrow")
df = pd.DataFrame(df)
X = df['embeddings'].to_numpy() # Convert Series to NumPy array
X = np.array(X.tolist()) # Convert list of arrays to a 2D NumPy array
X = X.reshape(X.shape[0], -1) # Flatten the 3D array into a 2D array
y = df['labels']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the classifier

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_pred)
print("AUC Score:", auc_score)

with open("metrics.txt", "w") as f:
    f.write("Accuracy: " + str(accuracy) + "\n")
    f.write("AUC Score: " + str(auc_score))
    
# Make predictions on the test set
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Save ROC curve plot to an image file
plt.savefig('roc_curve.png')

# Close plot to free memory

Solution

  • Few ideas:

    Ensemble of models

    Train N models (you must chose N, depending on RAM usage), each on only separate part of train data.

    Then make fusion of models, use predict_proba(x) method for each model to make an inference and calculate average predictions.

    This may have better/worse/same accuracy than single model, if N is not very large it should not have big impact.

    Fork of scikit learn

    Fork scikit learn and substitute every loop over x input train data with custom loop that loads the data from disk instead from RAM.

    This is hard or very hard, long approach and I am not sure what problems you will face on the way. In terms of difficulty worse would be only to write RF from the scratch.

    Other ideas