python-3.xscikit-learnnormalizationiqrstandardization

How .scale_ is calculated by sklearn in python? (What is it's algorithm exactly?)


Please, suppose that we have an array like this:

import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                     [ 2.,  0.,  0.],
                     [ 0.,  1., -1.]])

We scale it with .scale_ existed in sklearn by this code:

from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
scaler.scale_

and such result was shown:

array([0.81649658, 0.81649658, 1.24721913])

Are you know how it was calculated? If you know, please write its formula that how it is calculated? I supposes that .scale_ shows Interquartile range (IQR), but when I calculate it manually IQR is:

array([2, 2, 3]) rather than `array([0.81649658, 0.81649658, 1.24721913])`.

Also, I think array([0.81649658, 0.81649658, 1.24721913]) is a normal type of array([2, 2, 3]), but I don't know how it was normalized. Please, help me to find it.


Solution

  • Three main statistic measures of mean, variance, and Standard deviation are calculated with

    mean = preprocessing.StandardScaler().fit(X_train).mean_ 
    variance = preprocessing.StandardScaler().fit(X_train).var_
    Standard_deviation = preprocessing.StandardScaler().fit(X_train).scale_
    

    according to the question:

    X_train = np.array([[ 1., -1.,  2.],
                         [ 2.,  0.,  0.],
                         [ 0.,  1., -1.]])
    

    mean = preprocessing.StandardScaler().fit(X_train).mean_ 
    print(mean)
    array([1.        , 0.        , 0.33333333])
    

    variance = preprocessing.StandardScaler().fit(X_train).var_
    print(variance )
    array([0.66666667, 0.66666667, 1.55555556])
    

    Standard_deviation = preprocessing.StandardScaler().fit(X_train).scale_
    print(Standard_deviation )
    array([0.81649658, 0.81649658, 1.24721913])
    

    in other words:

    scaler.scale_ = np.sqrt(scaler.var_)