orange

Orange: How to make sure the same PCA is applied to both the train and test datasets?


In Orange, I can attach a dataset to a PCA for dimensionality reduction.

Typically, in code, I would apply the trained PCA to test data after fitting it to the training data.

In Orange, it appears as if the PCA can only be placed downstream from either the train or the test set.

Is there a way to run the PCA transform trained on the training data on the test data?


Solution

  • If you train the model on PCA-transformed data, just use it on the data with original variables and they will be transformed automatically.

    If you split the data using Data Sampler, you can do the following.

    enter image description here

    I used Select Columns to remove the original variables.

    When Logistic regression the model gets the data to classify (the test data, "Remaining data" from data sample, it will project it to PCA coordinates.

    If you want to use Test and Score, do this:

    enter image description here

    In Test and Score, don't forget to check "Test on test data". This is a bit different because Test and Score actually sees that the test data has different variables than train data and transforms it.

    For cross validation, you need the following.

    enter image description here

    In Preprocess, add Principal Component Analysis. In Test and Score, use cross-validation (or whatever). In this workflow, Preprocess widget provides a list of preprocessors that are computed on (= fit on) training data (of each cross-validation iteration), then Test and Score applies this projection (or another preprocessing) to testing data, fits the model(s) and tests them.