pythonpandascategorical-dataone-hot-encodingdummy-variable

Decide which category to drop in pandas get_dummies()


Let's say I have the following df:

data = [{'c1':a, 'c2':x}, {'c1':b,'c2':y}, {'c1':c,'c2':z}]
df = pd.DataFrame(data)

Output:

       c1 c2
    0  a  x
    1  b  y
    2  c  z

Now I want to use pd.get_dummies() to one hot encode the two categorical columns c1 and c2 and drop the first category of each col pd.get_dummies(df, columns = ['c1', 'c2'], drop_first=True). How can I decide which category to drop, without knowing the rows' order? Is there any command I missed?


EDIT: So my goal would be to e.g., drop category b from c1 and z from c2

Output:

       a  c  x  y
    0  1  0  1  0
    1  0  0  0  1
    2  0  1  0  0

Solution

  • One trick is replace values to NaNs - here is removed one value per rows:

    #columns with values for avoid
    d = {'c1':'b', 'c2':'z'}
    
    d1 = {k:{v: np.nan} for k, v in d.items()}
    df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
    print (df)
       a  c  x  y
    0  1  0  1  0
    1  0  0  0  1
    2  0  1  0  0
    

    If need multiple values for remove per column use lists like:

    d = {'c1':['b','c'], 'c2':['z']}
    
    d1 = {k:{x: np.nan for x in v} for k, v in d.items()}
    print (d1)
    {'c1': {'b': nan, 'c': nan}, 'c2': {'z': nan}}
    
    df = pd.get_dummies(df.replace(d1), columns = ['c1', 'c2'], prefix='', prefix_sep='')
    print (df)
       a  x  y
    0  1  1  0
    1  0  0  1
    2  0  0  0
    

    EDIT:

    If values are unique per columns simplier is them removed in last step:

    df = (pd.get_dummies(df, columns = ['c1', 'c2'], prefix='', prefix_sep='')
            .drop(['b','z'], axis=1))
    print (df)
       a  c  x  y
    0  1  0  1  0
    1  0  0  0  1
    2  0  1  0  0