pythonpandascategorization

How to efficiently categorize column in Pandas (no pd.cut)?


I have a large dataframe (millions of rows) and I want to categorize a numerical column into ordered labels, but I want to give names only to some specific values and put all other values into the "transient" category. Using pd.cut looks cumbersome and tricky in some cases.
Is there an efficient way to get the same result of the example code?

EDIT: Note that the final column is an ordered CategoricalDtype, not simply a column of strings, so the question is: is there any way to get the result without first mapping to strings and than convert to categorical?

import pandas as pd

MODES = [0, 9, 15, 25, 40]
CATS = 'B BC/2 BC AB ABC'.split()

# Build categorical type
my_cats = pd.CategoricalDtype(categories=['transient', 'B', 'BC/2', 'BC', 'AB', 'ABC'], ordered=True)

# The data to be categorized
df = pd.DataFrame(data=[-3, 1.99, 0, 3, 9, 12, 15, 17, 24.9999999, 25, 25.000000001, 34, 40-1e-13, 40], columns=['val'])

# Building...
df['cat'] = df['val'].replace(dict(zip(MODES, CATS)))
c_ = ~df['cat'].isin(CATS)
df.loc[c_, 'cat'] = 'transient'
df['cat'] = df['cat'].astype(my_cats)

# What I want
print(df)


Solution

  • Since you want an ordered categorical, just work with CategoricalDtype, but use the MODES as original categories, then rename_categories with the CATS (and optionally fillna):

    fill_value = 'transient'
    mode_dtype = pd.CategoricalDtype(categories=[fill_value]+MODES, ordered=True)
    df['cat'] = (df['val'].astype(mode_dtype).cat.rename_categories([fill_value]+CATS)
                          .fillna(fill_value)
                )
    

    And to generalize and streamline this for arbitrary categories, you can use a custom function:

    def cat_converter(series, modes, cats, fill_value='transient'):
        mode_dtype = pd.CategoricalDtype(categories=[fill_value]+modes, ordered=True)
        return (series.astype(mode_dtype)
                      .cat.rename_categories([fill_value]+cats)
                      .fillna(fill_value)
                )
    
    df['cat'] = df['val'].pipe(cat_converter, MODES, CATS)
    # or
    # df['cat'] = cat_converter(df['val'], MODES, CATS)
    

    Output:

          val        cat
    0   -3.00  transient
    1    1.99  transient
    2    0.00          B
    3    3.00  transient
    4    9.00       BC/2
    5   12.00  transient
    6   15.00         BC
    7   17.00  transient
    8   25.00  transient
    9   25.00         AB
    10  25.00  transient
    11  34.00  transient
    12  40.00  transient
    13  40.00        ABC
    

    And you directly get an ordered Categorical:

    df['cat'].dtype
    # CategoricalDtype(categories=['transient', 'B', 'BC/2', 'BC', 'AB', 'ABC'], ordered=True, categories_dtype=object)
    

    This is as fast as map for large DataFrames, and significantly faster than further converting to Categorical after map+fillna, since you already get a Categorical.

    pandas map categories different from original data categoricaldtype

    Absolute timings: pandas map categories different from original data categoricaldtype