pythonpandasdatatablesanonymous-typesscramble

Anonymise data in python across multiple tables and keep the relationships between related columns post anonymisation?


I am working on a project where there are two seperate csv files which I have pulled from a database. I want to load the data in python using pandas and anonymise contents of some of the columns in both tables. Some of these columns that has data that will be anonymised already exist in one of the other tables that shall also be anonymised. But I want them to anonymise to the same thing.

Is there a way to keep the relationships post anonymisation? ie the data in both tables that sit in different columns get anonymised to the same thing?

I have seen many examples online but only for anonymising single tables. How can this be done for two tables and keeping the relationships between the columns so they anonymise to the same thing, as in the example below?

Example:

Both tables, both columns pre anonymisation, but colA in Table 1 is related to colC in Table 2.

Pre anonymisation:

Table 1

colA colB
123456 abcdefg
789123 hijklm

Table 2:

colC colD
123456 xyz123
789123 abc456

Post anonymisation:

Table 1

colA colB
912056 zxcvbn
450912 poiuyt

Table 2:

colC colD
912056 qwe098
450912 asd321

Solution

  • You need to first set a dictionary for replacement. If you'll be using it for changing values in all your tables, you will keep relationships:

    import pandas as pd
    import random
    df = pd.DataFrame({'ID': [1,1,2,3],
                      'value': ['sdfsdfs', 'fhgdhds', 'rgeerfdg', 'regfdd']})
    # create dictionary to replace ID column with
    IDs = df['ID'].unique()
    anon = {}
    for ID in IDs:
        annon[ID] = random.randint(111111,999999)
    df.replace({'ID': anon}, inplace=True)
    df
    
           ID     value
    0  373005   sdfsdfs
    1  373005   fhgdhds
    2  152871  rgeerfdg
    3  934125    regfdd