pythonmachine-learningpysparkfeature-selectiongoogle-cloud-dataproc

Feature Selection in PySpark


I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.

from sklearn.feature_selection import RFECV,RFE

logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)

However, I could not find any article which could show how can I perform recursive feature selection in pyspark.

I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.

Could please someone help me achieve this in pyspark


Solution

  • We can try following feature selection methods in pyspark

    References: