pythonpandaslabel-encoding

Label encoding by value counts


I try to do label encoding for my cities. However, I want it to label according to which city is more than others. Let's say; Oslo has 500 rows Berlin has 400 rows Napoli has 300 rows in the dataset So label encoding will label those cities according to value counts so; Oslo should be labeled as 0, Berlin should be labeled 1, Napoli should labeled as 2

How I can do that?


Solution

  • Use Series.map by Series with indices by Series.value_counts (sorted values by default):

    df = pd.DataFrame({'col': ['Berlin'] * 4 + ['Oslo'] * 5 + ['Napoli'] * 3})
    print (df)
    
    s = df['col'].value_counts()
    print (s)
    Oslo      5
    Berlin    4
    Napoli    3
    Name: col, dtype: int64
    
    s1 = pd.Series(range(len(s)), index=s.index)
    print (s1)
    Oslo      0
    Berlin    1
    Napoli    2
    dtype: int64
           
    df['newcol'] = df['col'].map(s1)
    print (df)
           col  newcol
    0   Berlin       1
    1   Berlin       1
    2   Berlin       1
    3   Berlin       1
    4     Oslo       0
    5     Oslo       0
    6     Oslo       0
    7     Oslo       0
    8     Oslo       0
    9   Napoli       2
    10  Napoli       2
    11  Napoli       2
    

    Or use dictionary with enumerate:

    s = df['col'].value_counts()
    d = {v: k for k, v in enumerate(s.index)}
    print (d)
    {'Oslo': 0, 'Berlin': 1, 'Napoli': 2}      
    
    df['newcol'] = df['col'].map(d)
    print (df)
           col  newcol
    0   Berlin       1
    1   Berlin       1
    2   Berlin       1
    3   Berlin       1
    4     Oslo       0
    5     Oslo       0
    6     Oslo       0
    7     Oslo       0
    8     Oslo       0
    9   Napoli       2
    10  Napoli       2
    11  Napoli       2