pythonmachine-learningscalingtest-datanormalization

How can I fit the test data using min max scaler when I am loading the model?


I am doing auto encoder model.I have saved the model before which I scaled the data using min max scaler.

X_train = df.values
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

After doing this I fitted the model and saved it as 'h5' file.Now when I give test data, after loading the saved model naturally it should be scaled as well.

So when I load the model and scale it by using

X_test_scaled  = scaler.transform(X_test)

It gives the error

NotFittedError: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

So I gave X_test_scaled = scaler.fit_transform(X_test) (Which I had a hunch that it is foolish)did gave a result(after loading saved model and test) which was different when I trained it and test it together. I have saved around 4000 models now for my purpose(So I cant train and save it all again as it costs a lot time,So I want a way out).

Is there a way I can scale the test data by transforming it the way I trained it(may be saving the scaled values, I do not know).Or may be descale the model so that I can test the model on non-scaled data.

If I under-emphasized or over-emphasized any point ,please let me know in the comments!


Solution

  • X_test_scaled  = scaler.fit_transform(X_test)
    

    will scale X_test given the minimum and maximum values of features in X_test and not X_train.

    The reason your original code did not work is because you probably did not save scaler after fitting it to X_train or overwrote it somehow (for e.g., by re-initializing it). This is why the error was thrown as scaler was not fitted to any data.

    When you then call X_test_scaled = scaler.fit_transform(X_test), you are fitting scaler to X_test and simultaneously tranforming X_test, which was why the code was able to run, but this step is incorrect as you already surmised.

    What you want is

    X_train = df.values
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    
    # Save scaler
    import pickle as pkl
    with open("scaler.pkl", "wb") as outfile:
        pkl.dump(scaler, outfile)
    
    # Some other code for training your autoencoder
    # ...
    

    Then in your test script

    # During test time
    # Load scaler that was fitted on training data
    with open("scaler.pkl", "rb") as infile:
        scaler = pkl.load(infile)
        X_test_scaled = scaler.transform(X_test)  # Note: not fit_transform.
    

    Note you don't have to re-fit the scaler object after loading it back from disk. It contains all the information (the scaling factors etc.) obtained from the training data. You just call it on X_test.