pythonpandasdataframedictionarylines-of-code

Create Unique ID by combining First Name, Middle Name and Date of Birth


I have the following dataframe of names and date of birth. The data type is all over the place, missing values and so on, like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth
Employee    Paulo       Cortez  01-01-90
Employee    Paulo   Ricardo Cortez  01-01-90
Employee    Paulo       Cortez  01-01-90
Employee    Paulo       Cortez  02-01-90
Employee            Cortez  
Employee    Paulo       Cortez  $
Employee    Maria   ##  Silva   02-01-90
Employee    o,89    Pedro       s
Employee    Maria       Silva   
Employee    Maria       Silva   02-01-90
Employee    Joao        Augusto 02-01-90
Employee    Maria       Silva   

I need a way to create an ID column, taking whatever value I have in the combination of names + date of birth and create an unique ID per each of the rows like this one:

Employee type   First Name  Middle Name Last Name   Date of Birth   ID
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo   Ricardo Cortez  01-01-90    10001
Employee    Paulo       Cortez  01-01-90    10000
Employee    Paulo       Cortez  02-01-90    10002
Employee            Cortez      10003
Employee    Paulo       Cortez  $   10004
Employee    Maria   ##  Silva   02-01-90    10005
Employee    o,89    Pedro       s   10006
Employee    Maria       Silva       10007
Employee    Maria       Silva   02-01-90    10008
Employee    Joao        Augusto 02-01-90    10009
Employee    Maria       Silva       10007

The ID should start in 10000 What would be the best way to create this column?


Solution

  • df
    ###
      Employee type First Name Last Name Date of Birth
    0      Employee      Paulo    Cortez      01-01-90
    1      Employee      Paulo    Cortez      01-01-90
    2      Employee      Paulo    Cortez      01-01-90
    3      Employee      Paulo       NaN      01-01-90
    4      Employee      Maria     Silva      02-01-90
    5      Employee        NaN     Silva      04-10-90
    6      Employee       Joao   Augusto      12-11-89
    

    Here you can see the difference between cumsum() and ngroup().

    The numbers given to the groups match the order in which the groups would be seen when iterating over the groupby object, not the order they are first observed.

    df['ID'] = (df.groupby(['First Name', 'Last Name', 'Date of Birth'] ,dropna=False).cumcount() == 0).cumsum() + 9999
    df['ID_ngroup'] = df.groupby(['First Name', 'Last Name', 'Date of Birth'] ,sort=False ,dropna=False).ngroup() + 10000
    print(df)
    ###
      Employee type First Name Last Name Date of Birth     ID  ID_ngroup
    0      Employee      Paulo    Cortez      01-01-90  10000      10000
    1      Employee      Paulo    Cortez      01-01-90  10000      10000
    2      Employee      Paulo    Cortez      01-01-90  10000      10000
    3      Employee      Paulo       NaN      01-01-90  10001      10001
    4      Employee      Maria     Silva      02-01-90  10002      10002
    5      Employee        NaN     Silva      04-10-90  10003      10003
    6      Employee       Joao   Augusto      12-11-89  10004      10004