I have a dataframe with personal information and duplicated people, represented by the duplicated ID and their job (if the person appears more than once they may have different jobs represented in the data). Like in this example:
ID Name Street Birth Job Approved?
1 Paulo Av. 15, 7 1989-14-07 Carpinter Yes
2 Maria Stret Sky 2 1989-01-01 Driver Yes
3 John Stret Sky 3 2000-01-02 Mechanic Yes
1 Paulo Av. 15, 7 1989-14-07 Mechanic Yes
2 Maria Stret Sky 2 1989-01-01 Carpinter Yes
4 Peter Ocen Avenue 2001-01-03 Soldier Yes
I need a way to find a way to mask the columns "Name, Street, Birth, Job" using the same data to mask each column if the "ID" is the same. The values doesn't really need to represent anything, just be on the same data type as the column (for example, a real date of birth). This way resulting in a data like this:
ID Name Street Birth Job Approved?
1 Fake1 Street1 2000-01-01 Job1 Yes
2 Fake2 Street2 2000-01-02 Job2 Yes
3 Fake3 Street3 2000-01-03 Job3 Yes
1 Fake1 Street1 2000-01-01 Job1 Yes
2 Fake2 Street2 2000-01-02 Job2 Yes
4 Fake4 Street4 2000-01-03 Job4 Yes
What would be the best way to do it? Thank you!
Use custom function with factorize
with if
statement for datetime
s and different substring for name
s:
cols = ['Name','Street','Birth','Job']
def f(x):
if x.name =='Birth':
return pd.to_datetime(pd.factorize(x)[0], origin='2000-01-01', unit='d')
if x.name =='Name':
x.name = 'Fake'
return x.name + pd.Series(pd.factorize(x)[0] + 1, index=x.index).astype(str)
df[cols] = df[cols].apply(f)
print (df)
ID Name Street Birth Job Approved?
0 1 Fake1 Street1 2000-01-01 Job1 Yes
1 2 Fake2 Street2 2000-01-02 Job2 Yes
2 3 Fake3 Street3 2000-01-03 Job3 Yes
3 1 Fake1 Street1 2000-01-01 Job3 Yes
4 2 Fake2 Street2 2000-01-02 Job1 Yes
5 4 Fake4 Street4 2000-01-04 Job4 Yes