pythonpandasseriescategorical-databinning

Pandas: convert categories to numbers


Suppose I have a dataframe with countries that goes as:

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead.

I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.


Solution

  • First, change the type of the column:

    df.cc = pd.Categorical(df.cc)
    

    Now the data look similar but are stored categorically. To capture the category codes:

    df['code'] = df.cc.codes
    

    Now you have:

       cc  temp  code
    0  US  37.0     2
    1  CA  12.0     1
    2  US  35.0     2
    3  AU  20.0     0
    

    If you don't want to modify your DataFrame but simply get the codes:

    df.cc.astype('category').codes
    

    Or use the categorical column as an index:

    df2 = pd.DataFrame(df.temp)
    df2.index = pd.CategoricalIndex(df.cc)