I have a large dataframe (millions of rows) and I want to categorize a numerical column into ordered labels, but I want to give names only to some specific values and put all other values into the "transient" category. Using pd.cut
looks cumbersome and tricky in some cases.
Is there an efficient way to get the same result of the example code?
EDIT: Note that the final column is an ordered CategoricalDtype
, not simply a column of strings, so the question is: is there any way to get the result without first mapping to strings and than convert to categorical?
import pandas as pd
MODES = [0, 9, 15, 25, 40]
CATS = 'B BC/2 BC AB ABC'.split()
# Build categorical type
my_cats = pd.CategoricalDtype(categories=['transient', 'B', 'BC/2', 'BC', 'AB', 'ABC'], ordered=True)
# The data to be categorized
df = pd.DataFrame(data=[-3, 1.99, 0, 3, 9, 12, 15, 17, 24.9999999, 25, 25.000000001, 34, 40-1e-13, 40], columns=['val'])
# Building...
df['cat'] = df['val'].replace(dict(zip(MODES, CATS)))
c_ = ~df['cat'].isin(CATS)
df.loc[c_, 'cat'] = 'transient'
df['cat'] = df['cat'].astype(my_cats)
# What I want
print(df)
Since you want an ordered categorical, just work with CategoricalDtype
, but use the MODES as original categories, then rename_categories
with the CATS (and optionally fillna
):
fill_value = 'transient'
mode_dtype = pd.CategoricalDtype(categories=[fill_value]+MODES, ordered=True)
df['cat'] = (df['val'].astype(mode_dtype).cat.rename_categories([fill_value]+CATS)
.fillna(fill_value)
)
And to generalize and streamline this for arbitrary categories, you can use a custom function:
def cat_converter(series, modes, cats, fill_value='transient'):
mode_dtype = pd.CategoricalDtype(categories=[fill_value]+modes, ordered=True)
return (series.astype(mode_dtype)
.cat.rename_categories([fill_value]+cats)
.fillna(fill_value)
)
df['cat'] = df['val'].pipe(cat_converter, MODES, CATS)
# or
# df['cat'] = cat_converter(df['val'], MODES, CATS)
Output:
val cat
0 -3.00 transient
1 1.99 transient
2 0.00 B
3 3.00 transient
4 9.00 BC/2
5 12.00 transient
6 15.00 BC
7 17.00 transient
8 25.00 transient
9 25.00 AB
10 25.00 transient
11 34.00 transient
12 40.00 transient
13 40.00 ABC
And you directly get an ordered Categorical:
df['cat'].dtype
# CategoricalDtype(categories=['transient', 'B', 'BC/2', 'BC', 'AB', 'ABC'], ordered=True, categories_dtype=object)
This is as fast as map
for large DataFrames, and significantly faster than further converting to Categorical after map
+fillna
, since you already get a Categorical.