pythonpandasscikit-learn

sklearn MinMaxScaler() with groupby pandas


I have two features rank and ratings for different product IDs under different categories scraped from an ecommerce website on different dates.

sample dataframe available here:

import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv')
df.head()

      category                bid         date  rank    ratings
0   Aftershave  ASCDBNYZ4JMSH42B    2021-10-01  61.0    462.0
1   Aftershave  ASCDBNYZ4JMSH42B    2021-10-02  69.0    462.0
2   Aftershave  ASCDBNYZ4JMSH42B    2021-10-05  89.0    463.0
3   Aftershave  ASCE3DZK2TD7G4DN    2021-10-01  309.0   3.0
4   Aftershave  ASCE3DZK2TD7G4DN    2021-10-02  319.0   3.0

I want to normalize rank and ratings using MinMaxScaler() from sklearn.

I tried

cols=['rank','ratings']
features=df[cols]
scaler1=MinMaxScaler()
df_norm[['rank_norm_mm', 'ratings_norm_mm']] = scaler1.fit_transform(features)

This normalizes over entire dataset. I want to do this over each category for each particular date using groupby.


Solution

  • Use GroupBy.apply:

    file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv'
    df=pd.read_csv(file)
    
    from sklearn.preprocessing import MinMaxScaler
    
    cols=['rank','ratings']
    
    def f(x):
        scaler1=MinMaxScaler()
        x[['rank_norm_mm', 'ratings_norm_mm']] = scaler1.fit_transform(x[cols])
        return x
    
    df = df.groupby(['category', 'date']).apply(f)
    

    Another solution:

    file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv'
    df=pd.read_csv(file)
    
    from sklearn.preprocessing import MinMaxScaler
    
    scaler1=MinMaxScaler()
    cols=['rank','ratings']
    
    df= df.join(df.groupby(['category', 'date'])[cols]
                   .apply(lambda x: pd.DataFrame(scaler1.fit_transform(x), index=x.index))
                   .add_prefix('_norm_mm'))