I have the following dataframe of names and date of birth. The data type is all over the place, missing values and so on, like this one:
Employee type First Name Middle Name Last Name Date of Birth
Employee Paulo Cortez 01-01-90
Employee Paulo Ricardo Cortez 01-01-90
Employee Paulo Cortez 01-01-90
Employee Paulo Cortez 02-01-90
Employee Cortez
Employee Paulo Cortez $
Employee Maria ## Silva 02-01-90
Employee o,89 Pedro s
Employee Maria Silva
Employee Maria Silva 02-01-90
Employee Joao Augusto 02-01-90
Employee Maria Silva
I need a way to create an ID column, taking whatever value I have in the combination of names + date of birth and create an unique ID per each of the rows like this one:
Employee type First Name Middle Name Last Name Date of Birth ID
Employee Paulo Cortez 01-01-90 10000
Employee Paulo Ricardo Cortez 01-01-90 10001
Employee Paulo Cortez 01-01-90 10000
Employee Paulo Cortez 02-01-90 10002
Employee Cortez 10003
Employee Paulo Cortez $ 10004
Employee Maria ## Silva 02-01-90 10005
Employee o,89 Pedro s 10006
Employee Maria Silva 10007
Employee Maria Silva 02-01-90 10008
Employee Joao Augusto 02-01-90 10009
Employee Maria Silva 10007
The ID should start in 10000 What would be the best way to create this column?
df
###
Employee type First Name Last Name Date of Birth
0 Employee Paulo Cortez 01-01-90
1 Employee Paulo Cortez 01-01-90
2 Employee Paulo Cortez 01-01-90
3 Employee Paulo NaN 01-01-90
4 Employee Maria Silva 02-01-90
5 Employee NaN Silva 04-10-90
6 Employee Joao Augusto 12-11-89
Here you can see the difference between cumsum()
and ngroup()
.
Use dropna=False
, otherwise groups within na
value would fall into the same group(ID
).
Utilize sort=False
within ngroup()
method, otherwise
The numbers given to the groups match the order in which the groups would be seen when iterating over the groupby object, not the order they are first observed.
df['ID'] = (df.groupby(['First Name', 'Last Name', 'Date of Birth'] ,dropna=False).cumcount() == 0).cumsum() + 9999
df['ID_ngroup'] = df.groupby(['First Name', 'Last Name', 'Date of Birth'] ,sort=False ,dropna=False).ngroup() + 10000
print(df)
###
Employee type First Name Last Name Date of Birth ID ID_ngroup
0 Employee Paulo Cortez 01-01-90 10000 10000
1 Employee Paulo Cortez 01-01-90 10000 10000
2 Employee Paulo Cortez 01-01-90 10000 10000
3 Employee Paulo NaN 01-01-90 10001 10001
4 Employee Maria Silva 02-01-90 10002 10002
5 Employee NaN Silva 04-10-90 10003 10003
6 Employee Joao Augusto 12-11-89 10004 10004