pythonpandasdataframe

Mapping a column inside a dataframe to a new type with Pandas 2.2.3+


I am used to being able to do things like:

import pandas as pd
df = pd.DataFrame( pd.Categorical(['a','b','b'],['a','b']),columns=['x'])
df.loc[:,'x'] = df['x'].replace({'a':1, 'b':2})

However, with newer pandas, it throws a warning:

/tmp/ipykernel_1721527/1018712932.py:4: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '[1, 2, 2]
Categories (2, object): [1, 2]' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  df.loc[:,'x'] = df['x'].replace({'a':1, 'b':2})

Shortest workaround I can think of is:

ncol = df['x'].replace({'a':1, 'b':2}).astype('float')
df['x'] = None
df = df.astype({'x':'float'})
df.loc[:,'x'] = ncol

But this seems way too long and unelegant for what is ostensibly a very simple operation. Am I missing something obvious?


Solution

  • Ironically, the first part of your question was asked just a few minutes ago. You should not use a slice (df.loc[:, 'x']) but rather recreate the column (df['x']) in your assignment when changing the dtype (changing categories changes the dtype).

    The second part requires to use cat.rename_categories instead of replace since categories are immutable in a Categorical Series, or map if you change all the values and do not want a Categorical:

    df['x'] = df['x'].cat.rename_categories({'a':1, 'b':2})
    
    # or with map
    df['x'] = df['x'].map({'a':1, 'b':2})
    

    Output:

       x
    0  1
    1  2
    2  2
    

    Demonstration the the dtypes are different when categories change:

    df = pd.DataFrame(pd.Categorical(['a','b','b'], ['a','b']),columns=['x'])
    
    df['x'].dtype
    # CategoricalDtype(categories=['a', 'b'], ordered=False, categories_dtype=object)
    
    df['x'].cat.rename_categories({'a':1, 'b':2}).dtype
    # CategoricalDtype(categories=[1, 2], ordered=False, categories_dtype=int64)
    
    df['x'].dtype == df['x'].cat.rename_categories({'a':1, 'b':2}).dtype
    # False
    
    df['x'].dtype == df['x'].cat.rename_categories({'a':'1', 'b':'2'}).dtype
    # False
    
    df['x'].dtype == df['x'].copy().dtype
    # True