python-2.7machine-learningscikit-learnrandom-forest

How to apply model trained with PCA and Random Forest to test data?


In solving one of the machine learning problem, I am implementing PCA on training data and and then applying .transform on train data using sklearn. After observing the variances, I retain only those columns from the transformed data whose variance is large. Then I am training the model using RandomForestClassifier. Now, I am confused: how to apply that trained model on the test data as the number of columns of test data and the retained transformed data (on which random forest is applied) is different?


Solution

  • Here is a way of doing it if this is what you seek... ideally u should use the same number of principle components in test as well as train... otherwise defeats the purpose of a hold-out set.

    pca = PCA(n_components=20)
    train_features = pca.fit_transform(train_data)
    
    rfr = sklearn.RandomForestClassifier(n_estimators = 100, n_jobs = 1, 
                                             random_state = 2016, verbose = 1,
                                             class_weight='balanced',oob_score=True)
    
    rfr.fit(train_features)
    
    test_features = pca.transform(test_data)
    rfr.predict(test_features)