I am working on a project where there are two seperate csv files which I have pulled from a database. I want to load the data in python using pandas and anonymise contents of some of the columns in both tables. Some of these columns that has data that will be anonymised already exist in one of the other tables that shall also be anonymised. But I want them to anonymise to the same thing.
Is there a way to keep the relationships post anonymisation? ie the data in both tables that sit in different columns get anonymised to the same thing?
I have seen many examples online but only for anonymising single tables. How can this be done for two tables and keeping the relationships between the columns so they anonymise to the same thing, as in the example below?
Both tables, both columns pre anonymisation, but colA in Table 1 is related to colC in Table 2.
Pre anonymisation:
Table 1
colA | colB |
---|---|
123456 | abcdefg |
789123 | hijklm |
Table 2:
colC | colD |
---|---|
123456 | xyz123 |
789123 | abc456 |
Post anonymisation:
Table 1
colA | colB |
---|---|
912056 | zxcvbn |
450912 | poiuyt |
Table 2:
colC | colD |
---|---|
912056 | qwe098 |
450912 | asd321 |
You need to first set a dictionary for replacement. If you'll be using it for changing values in all your tables, you will keep relationships:
import pandas as pd
import random
df = pd.DataFrame({'ID': [1,1,2,3],
'value': ['sdfsdfs', 'fhgdhds', 'rgeerfdg', 'regfdd']})
# create dictionary to replace ID column with
IDs = df['ID'].unique()
anon = {}
for ID in IDs:
annon[ID] = random.randint(111111,999999)
df.replace({'ID': anon}, inplace=True)
df
ID value
0 373005 sdfsdfs
1 373005 fhgdhds
2 152871 rgeerfdg
3 934125 regfdd