I have a dataset that is 30 million rows. I have two columns: one that contains a 1 or 0 label and the other column has a list of 1280 features for each row (181 GB total). All I want to do is plug this dataset into a Random Forest algorithm, but the memory runs out and it crashes (I've tried using memory of 400 GB, but it still crashes).
After loading the dataset, I had to manipulate it a bit since it is in the Hugging Face arrow format: https://huggingface.co/docs/datasets/en/about_arrow (I suspect this is taking up a lot of RAM).
I am aware I could do some dimensionality reduction to my dataset, but is there any changes I should make to my code to reduce RAM usage?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, auc
from datasets import load_dataset, Dataset
# Load dataset
df = Dataset.from_file("data.arrow")
df = pd.DataFrame(df)
X = df['embeddings'].to_numpy() # Convert Series to NumPy array
X = np.array(X.tolist()) # Convert list of arrays to a 2D NumPy array
X = X.reshape(X.shape[0], -1) # Flatten the 3D array into a 2D array
y = df['labels']
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the classifier
rf_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Evaluate the classifier
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Calculate AUC score
auc_score = roc_auc_score(y_test, y_pred)
print("AUC Score:", auc_score)
with open("metrics.txt", "w") as f:
f.write("Accuracy: " + str(accuracy) + "\n")
f.write("AUC Score: " + str(auc_score))
# Make predictions on the test set
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
# Save ROC curve plot to an image file
plt.savefig('roc_curve.png')
# Close plot to free memory
Few ideas:
Train N models (you must chose N, depending on RAM usage), each on only separate part of train data.
Then make fusion of models, use predict_proba(x)
method for each model to make an inference and calculate average predictions.
This may have better/worse/same accuracy than single model, if N is not very large it should not have big impact.
Fork scikit learn and substitute every loop over x
input train data with custom loop that loads the data from disk instead from RAM.
This is hard or very hard, long approach and I am not sure what problems you will face on the way. In terms of difficulty worse would be only to write RF from the scratch.
data.astype(np.float32)
), or maybe even int16 if properly scaled+tranformed?sample_weight = number_of_averaged_samples
feature_importances
. Then disregard them while loading dataset next time.