I am working with the meta-feature extractor package: pymfe for complexity analysis. On a small dataset, this is not a problem, for example.
pip install -U pymfe
from sklearn.datasets import make_classification
from sklearn.datasets import load_iris
from pymfe.mfe import MFE
data = load_iris()
X= data.data
y = data.target
extractor = MFE(features=[ "t1"], groups=["complexity"],
summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
(['t1'], [0.12])
My dataset is large (32690, 80)
and this computation gets killed for exessive memory usage. I work on Ubuntu 24.04
having 32GB
RAM.
To reproduce scenario:
# Generate the dataset
X, y = make_classification(n_samples=20_000,n_features=80,
n_informative=60, n_classes=5, random_state=42)
extractor = MFE(features=[ "t1"], groups=["complexity"],
summary=["min", "max", "mean", "sd"])
extractor.fit(X,y)
extractor.extract()
Killed
Question:
How do I split this task to compute on small partitions of the dataset, and combine final results (averaging)?
Managed to find a workaround.
# helper functions
def split_dataset(X, y, n_splits):
# data splits
split_X = np.array_split(X, n_splits)
split_y = np.array_split(y, n_splits)
return split_X, split_y
def compute_meta_features(X, y):
# meta-features for a partition
extractor = MFE(features=["t1"], groups=["complexity"],
summary=["min", "max", "mean", "sd"])
extractor.fit(X, y)
return extractor.extract()
def average_results(results):
# summary of results
features = results[0][0]
summary_values = np.mean([result[1] for result in results], axis=0)
return features, summary_values
# Split dataset
n_splits = 10 # ten splits
split_X, split_y = split_dataset(X, y, n_splits)
# meta-features
results = [compute_meta_features(X_part, y_part) for X_part, y_part in zip(split_X, split_y)]
# Combined results
final_features, final_summary = average_results(results)