pythonpandassklearn-pandasdtype

MinMaxScaler doesn't scale small values to 1


I found weird behavior of sklearn.preprocessing.MinMaxScaler and same for sklearn.preprocessing.RobustScaler When data max value is very small < 10^(-16) transformer doesn't change data max value from raw data max value. Why? df_small.dtypes is float64, this type can represent smaller numbers. How can I fix it without handcrafted: data = data / data.max()

df_small = pd.DataFrame((np.arange(5)*10.0**(-16)))
scaler_small = MinMaxScaler()
small_transformed = scaler.fit_transform(df_small)
print(small_transformed)

[[0.e+00]
 [1.e-16]
 [2.e-16]
 [3.e-16]
 [4.e-16]]

df_not_small = pd.DataFrame((np.arange(5)*10.0**(-15)))
scaler_not_small = MinMaxScaler()
not_small_transformed = scaler_not_small.fit_transform(df_not_small)
print(not_small_transformed)

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]

Solution

  • When it's applying the scaling to use, the MinMaxScaler calls the _handle_zeros_in_scale() function, which has the check:

    constant_mask = scale < 10 * np.finfo(scale.dtype).eps
    

    For a dtype that is np.float64, the value of 10 * np.finfo(scale.dtype).eps is 2.220446049250313e-15, which is larger than your scale of 4e-16 in the second case (but smaller than the range 4e-15 in the first case). If the scale is smaller than this, it sets the scale factor to 1 (see this line):

    scale[constant_mask] = 1.0
    

    Unfortunately, you'll either have to manually scale the data yourself, or edit scikit-learn to change it to allow samples with smaller overall ranges.