scikit-learnnormalizationtrain-test-split

Scaling row-wise with MinMaxScaler from Sklearn


By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np

# %% Generating sample data

x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])

#%% Train/Test split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)

scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here

But i am getting the following error:

ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.

I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?


Solution

  • StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).

    What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).

    For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.