pythonscikit-learnscalenormalize

Normalization vs Scaling for not normal distribution in Scikit-learn


In my university project, I'm given data that has various ranges of values also not normal distribution. I already read the documentation of function of sklearn normalization it said normalization is a process of scaling individual samples to have unit norm. Also in sklearn there are Normalization and StandardScaler they seemed to have same function that is to scale the data. But then I read this article telling the differences between scaling and normalization distinguishing between them by saying that Normalization is the way for you to reach normal distribution and Scaling is the way for you to range you data.

  1. Why does sklearn Normalization function has the same function of StandardScaler if both supposed to do different thing to the data?
  2. Does it mean by scaling we could reach normal distribution and scaling is actually one of the way to normalize distribution?
  3. For my case if I have various ranges of values and not normal distribution, if Normalize and Scaling is different thing then it means I have to scale and to normalize my data?

Solution

  • Normalization has different meanings depending on the context and sometimes the term is misleading. I think sklearn uses the terms interchangeably, to mean adjusting values measured on different scales to a notionally common scale (e.g., between 0 and 1), rather than change the data such that they follow a Normal distribution (apart from the StandardScaler, which does that).

    From my understanding, in sklearn they differ in the input they work on and how, and where they can be used.

    I assume that with Normalization you mean sklearn.preprocessing.Normalizer.

    So, the main difference is that sklearn.preprocessing.Normalizer scales samples to unit norm (vector lenght) while sklearn.preprocessing.StandardScaler scales features to unit variance, after subtracting the mean. Therefore, the former works on the rows, while the latter on the columns.

    In particular,

    1. sklearn.preprocessing.normalize "scales input vectors individually to unit norm (vector length).'. It can either be applied to rows (by setting the parameter axis to 1) and to features/columns (by setting the parameter axis to 0). It uses one of the following norms: l1, l2, or max to normalize each non zero sample (or each non-zero feature if the axis is 0). Note: The term norm here refers to the mathematical definition. See here and here for more information.

    2. sklearn.preprocessing.Normalizer "normalizes samples individually to unit norm.". It behaves exactly as sklearn.preprocessing.normalize when axis=1. Differently from normalize, Normalizer performs normalization using the Transformer API (e.g. as part of a preprocessing sklearn.pipeline.Pipeline).

    3. sklearn.preprocessing.StandardScaler "standardizes features by removing the mean and scaling to unit variance". It does not use the norm of a vector, rather it computes the z-score for each feature.

    This interesting article explore more the differences among them.

    Let's use norm='max' for convenience:

    from sklearn.preprocessing import normalize, Normalizer, StandardScaler
    
    X = [[1, 2],
         [2, 4]]
    
    # Normalize column based on the maximum of each column (x/max(column))
    normalize(X, norm='max', axis=0)
    
    # Normalize column based on the maximum of each row (x/max(row))
    normalize(X, norm='max', axis=1)
    
    # Normalize with Normalizer (only rows)
    Normalizer(norm='max').fit_transform(X)
    
    # Standardize with StandardScaler (only columns)
    StandardScaler().fit_transform(X)
    
    
    from sklearn.pipeline import Pipeline
    pipe = Pipeline([('normalization_step', normalize())] # NOT POSSIBLE
    
    pipe = Pipeline([('normalization_step', Normalizer())] # POSSIBLE
    
    pipe = Pipeline([('normalization_step', StandardScaler())] # POSSIBLE
    
    pipe.score(X, y) # Assuming y exists
    
    

    The aforementioned lines of code would transform the data as follows:

    
    # Normalize with normalize, axis=0 (columns)
    [[0.5, 0.5],
     [1. , 1. ]]
    
    # Normalize with normalize, axis=1 (rows)
    [[0.5, 1],
     [0.5, 1. ]]
    
    # Normalize with Normalizer (rows)
    [[0.5, 1],
     [0.5, 1. ]]
    
    # Standardize with StandardScaler (columns)
    [[-1, -1],
     [1, 1. ]]