python pandas numpy performance scikit-learn

How to train Random Forest classifier with large dataset to avoid memory errors in Python?

I have a dataset that is 30 million rows. I have two columns: one that contains a 1 or 0 label and the other column has a list of 1280 features for each row (181 GB total). All I want to do is plug this dataset into a Random Forest algorithm, but the memory runs out and it crashes (I've tried using memory of 400 GB, but it still crashes).

After loading the dataset, I had to manipulate it a bit since it is in the Hugging Face arrow format: https://huggingface.co/docs/datasets/en/about_arrow (I suspect this is taking up a lot of RAM).

I am aware I could do some dimensionality reduction to my dataset, but is there any changes I should make to my code to reduce RAM usage?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, auc
from datasets import load_dataset, Dataset

# Load dataset
df = Dataset.from_file("data.arrow")
df = pd.DataFrame(df)
X = df['embeddings'].to_numpy() # Convert Series to NumPy array
X = np.array(X.tolist()) # Convert list of arrays to a 2D NumPy array
X = X.reshape(X.shape[0], -1) # Flatten the 3D array into a 2D array
y = df['labels']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the classifier

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_pred)
print("AUC Score:", auc_score)

with open("metrics.txt", "w") as f:
    f.write("Accuracy: " + str(accuracy) + "\n")
    f.write("AUC Score: " + str(auc_score))
    
# Make predictions on the test set
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Save ROC curve plot to an image file
plt.savefig('roc_curve.png')

# Close plot to free memory

Solution

Few ideas:

Ensemble of models

Train N models (you must chose N, depending on RAM usage), each on only separate part of train data.

Then make fusion of models, use predict_proba(x) method for each model to make an inference and calculate average predictions.

This may have better/worse/same accuracy than single model, if N is not very large it should not have big impact.

Fork of scikit learn

Fork scikit learn and substitute every loop over x input train data with custom loop that loads the data from disk instead from RAM.

This is hard or very hard, long approach and I am not sure what problems you will face on the way. In terms of difficulty worse would be only to write RF from the scratch.

Other ideas

RAM usage can be lowered by decreasing max_depth, n_estimators, max_features, etc.. Note those will affect your model accuracy (maybe in positive way! But to know this you would have to compare results...)
convert data to float32 (data.astype(np.float32)), or maybe even int16 if properly scaled+tranformed?
remove duplicates or close duplicates. Unlikely with so many features, but maybe you will find this useful. If there are samples that are very close to each other (distance metric I leave to you), calculate mean of those samples and replace them with this mean. Also give this sample sample_weight = number_of_averaged_samples
remove low importance features - train a model on random part of dataset and look at the model feature_importances. Then disregard them while loading dataset next time.