python python-3.x scikit-learn regression iris-dataset

Why is scaling the iris dataset making the MAE much worse?

This code is predicting sepal length from the iris dataset, and it is getting a MAE of around .94

from sklearn import metrics
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:, 1:]
y = iris.data[:, 0]  # sepal length

X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = MLPRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(metrics.mean_absolute_error(y_test, y_pred))

Though when I remove the scaling lines

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

the MAE goes down to .33. Am I scaling wrong, and why is the scaling making the error so much higher?

Solution

Interesting question. So let's test (putting random states for reproducible results where appropriate) non (sklearn.neural_network.MLPRegressor) neural net approach with and without scaling:

from sklearn import metrics
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn import datasets
import numpy as np
from sklearn.linear_model import LinearRegression

iris = datasets.load_iris()
X = iris.data[:, 1:]
y = iris.data[:, 0]  # sepal length


### pur random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1989)


lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

# Evaluating Model's Performance
print('Mean Absolute Error NO SCALE:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error NO SCALE:', metrics.mean_squared_error(y_test, pred))
print('Mean Root Squared Error NO SCALE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')

### put random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1989)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

# Evaluating Model's Performance
print('Mean Absolute Error YES SCALE:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error YES SCALE:', metrics.mean_squared_error(y_test, pred))
print('Mean Root Squared Error YES SCALE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

Gives:

Mean Absolute Error NO SCALE: 0.2789437424421388
Mean Squared Error NO SCALE: 0.1191038134603132
Mean Root Squared Error NO SCALE: 0.3451142035041635
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mean Absolute Error YES SCALE: 0.27894374244213865
Mean Squared Error YES SCALE: 0.11910381346031311
Mean Root Squared Error YES SCALE: 0.3451142035041634

Ok. Looks like you are doing everything right when it comes to scaling, but dealing with neural nets has many nuances and on top of that what may work for one architecture may not work for another, so when possible experimentation will show the best approach.

Running your code also gives the following error: _multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet. warnings.warn(

So your algorithm doesnt converge and hence your MAE is high. It is optimizing in steps and 100 wasn't enough, so iterations must be increased in order to finish your training and decrease MAE.

Additionally, because of the way error is propagated to weights during training big spread in targets may result in large gradients causing drastic changes in weights making training unstable or not converge at all.

Overall NNs TEND to perform best when inputs are on a common scale and TEND to train faster (max_iter parameter here, see below). We will check that next...

On top of that! Types of transforms may matter too, standardization vs normalization and types within which. For example in RNNs scaling from -1 to 1 TENDS to perform better than 0 - 1.

Lets run MLPRegressor experiments next

### DO IMPORTS
from sklearn import metrics
from sklearn.neural_network import *
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn import datasets
import numpy as np

### GET DATASET
iris = datasets.load_iris()
X = iris.data[:, 1:]
y = iris.data[:, 0]  # sepal length

#########################################################################################
# SCALE INPUTS = NO
# SCALE TARGETS = NO
#########################################################################################

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)


# put random state here as well because of the way NNs get set up there is randomization within initial parameters
# max iterations for each were found manually but you can also use grid search because its basically a hyperparameter

model = MLPRegressor(random_state = 100,max_iter=450)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('----------------------------------------------------------------------')
print("SCALE INPUTS =  NO & SCALE TARGETS = NO")
print('----------------------------------------------------------------------')
print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred))
print('Squared Error', metrics.mean_squared_error(y_test,  y_pred))
print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred)))

----------------------------------------------------------------------
SCALE INPUTS =  NO & SCALE TARGETS = NO
----------------------------------------------------------------------
Mean Absolute Error 0.25815648734192126
Squared Error 0.10196864342576142
Mean Root Squared Error 0.319325294058835

#########################################################################################
# SCALE INPUTS = YES
# SCALE TARGETS = NO
#########################################################################################

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = MLPRegressor(random_state = 100,max_iter=900)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('----------------------------------------------------------------------')
print("SCALE INPUTS = YES & SCALE TARGETS = NO")
print('----------------------------------------------------------------------')
print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred))
print('Squared Error', metrics.mean_squared_error(y_test,  y_pred))
print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred)))

----------------------------------------------------------------------
SCALE INPUTS = YES & SCALE TARGETS = NO
----------------------------------------------------------------------
Mean Absolute Error 0.2699225498998305
Squared Error 0.1221046275841224
Mean Root Squared Error 0.3494347257845482

#########################################################################################
# SCALE INPUTS = NO
# SCALE TARGETS = YES
#########################################################################################

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train.reshape(-1, 1))

### NO NEED TO RESCALE since network doesnt see it
# y_test = scaler_y.transform(y_test.reshape(-1, 1))

model = MLPRegressor(random_state = 100,max_iter=500)
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)

### rescale predictions back to y_test scale
y_pred_rescaled_back = scaler_y.inverse_transform(y_pred.reshape(-1, 1))

print('----------------------------------------------------------------------')
print("SCALE INPUTS = NO & SCALE TARGETS = YES")
print('----------------------------------------------------------------------')
print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred_rescaled_back))
print('Squared Error', metrics.mean_squared_error(y_test,  y_pred_rescaled_back))
print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred_rescaled_back)))

----------------------------------------------------------------------
SCALE INPUTS = NO & SCALE TARGETS = YES
----------------------------------------------------------------------
Mean Absolute Error 0.23602139631237182
Squared Error 0.08762790909543768
Mean Root Squared Error 0.29602011603172795

#########################################################################################
# SCALE INPUTS = YES
# SCALE TARGETS = YES
#########################################################################################

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

scaler_x = StandardScaler()
scaler_y = StandardScaler()

X_train = scaler_x.fit_transform(X_train)
X_test = scaler_x.transform(X_test)

y_train = scaler_y.fit_transform(y_train.reshape(-1, 1))
### NO NEED TO RESCALE since network doesnt see it
# y_test = scaler_y.transform(y_test.reshape(-1, 1))

model = MLPRegressor(random_state = 100,max_iter=250)
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)

### rescale predictions back to y_test scale
y_pred_rescaled_back = scaler_y.inverse_transform(y_pred.reshape(-1, 1))

print('----------------------------------------------------------------------')
print("SCALE INPUTS = YES & SCALE TARGETS = YES")
print('----------------------------------------------------------------------')
print('Mean Absolute Error', metrics.mean_absolute_error(y_test,  y_pred_rescaled_back))
print('Squared Error', metrics.mean_squared_error(y_test,  y_pred_rescaled_back))
print('Mean Root Squared Error', np.sqrt(metrics.mean_squared_error(y_test,  y_pred_rescaled_back)))

----------------------------------------------------------------------
SCALE INPUTS = YES & SCALE TARGETS = YES
----------------------------------------------------------------------
Mean Absolute Error 0.2423901612747137
Squared Error 0.09758236232324796
Mean Root Squared Error 0.3123817573470768

To summarize:

So looks like with this particular way of scaling for this particular architecture and dataset you converge the fastest with scaled inputs and scaled targets, but in the process probably lose some information (with this particular transform) that's useful in predictions and so your MAE is slightly higher than when you dont scale inputs but scale targets for example.

Even here however I think for example changing learning rate hyperparameter (within MLPRegressor) value can help converge faster when for example values are not scaled, but would need to experiment with that as well... As you can see... Many nuances indeed.

PS Some good discussions on this topic