pythonpandasreplacemaskdummy-data

Mask data using same values for duplicates pandas dataframe


I have a dataframe with personal information and duplicated people, represented by the duplicated ID and their job (if the person appears more than once they may have different jobs represented in the data). Like in this example:

ID  Name    Street      Birth       Job         Approved?
1   Paulo   Av. 15, 7   1989-14-07  Carpinter   Yes
2   Maria   Stret Sky 2 1989-01-01  Driver      Yes
3   John    Stret Sky 3 2000-01-02  Mechanic    Yes
1   Paulo   Av. 15, 7   1989-14-07  Mechanic    Yes
2   Maria   Stret Sky 2 1989-01-01  Carpinter   Yes
4   Peter   Ocen Avenue 2001-01-03  Soldier     Yes

I need a way to find a way to mask the columns "Name, Street, Birth, Job" using the same data to mask each column if the "ID" is the same. The values doesn't really need to represent anything, just be on the same data type as the column (for example, a real date of birth). This way resulting in a data like this:

ID  Name    Street  Birth       Job     Approved?
1   Fake1   Street1 2000-01-01  Job1    Yes
2   Fake2   Street2 2000-01-02  Job2    Yes
3   Fake3   Street3 2000-01-03  Job3    Yes
1   Fake1   Street1 2000-01-01  Job1    Yes
2   Fake2   Street2 2000-01-02  Job2    Yes
4   Fake4   Street4 2000-01-03  Job4    Yes

What would be the best way to do it? Thank you!


Solution

  • Use custom function with factorize with if statement for datetimes and different substring for names:

    cols = ['Name','Street','Birth','Job']
    def f(x):
        if x.name =='Birth':
            return pd.to_datetime(pd.factorize(x)[0], origin='2000-01-01', unit='d')
        if x.name =='Name':
            x.name = 'Fake'
        return x.name + pd.Series(pd.factorize(x)[0] + 1, index=x.index).astype(str)
    
    df[cols] = df[cols].apply(f)
    print (df)
       ID   Name   Street      Birth   Job Approved?
    0   1  Fake1  Street1 2000-01-01  Job1       Yes
    1   2  Fake2  Street2 2000-01-02  Job2       Yes
    2   3  Fake3  Street3 2000-01-03  Job3       Yes
    3   1  Fake1  Street1 2000-01-01  Job3       Yes
    4   2  Fake2  Street2 2000-01-02  Job1       Yes
    5   4  Fake4  Street4 2000-01-04  Job4       Yes