pythonpandasrecodeanonymize

What is the most efficient & pythonic way to recode a pandas column?


I'd like to 'anonymize' or 'recode' a column in a pandas DataFrame. What's the most efficient way to do so? I wrote the following, but it seems likely there's a built-in function or better way.

dataset = dataset.sample(frac=1).reset_index(drop=False) # reorders dataframe randomly (helps anonymization, since order could have some meaning)

# make dictionary of old and new values
value_replacer = 1
values_dict = {}   
for unique_val in dataset[var].unique():
    values_dict[unique_val] = value_replacer
    value_replacer += 1

# replace old values with new
for k, v in values_dict.items():
    dataset[var].replace(to_replace=k, value=v, inplace=True)

Solution

  • IIUC you want to factorize your values:

    dataset[var] = pd.factorize(dataset[var])[0] + 1
    

    Demo:

    In [2]: df
    Out[2]:
       col
    0  aaa
    1  aaa
    2  bbb
    3  ccc
    4  ddd
    5  bbb
    
    In [3]: df['col'] = pd.factorize(df['col'])[0] + 1
    
    In [4]: df
    Out[4]:
       col
    0    1
    1    1
    2    2
    3    3
    4    4
    5    2