I have a dataframe with customer names which I need to use for test data purposes, but need to obfuscate the names. The name needs to be deterministic: if the same name exists in the table then it should be obfuscated with the same 'fake' name.
For example: Susan H both need to have the same 'Fake' name
FullName | FakeName |
---|---|
Susan H | John F |
Eva B | Sarah E |
Susan H | John F |
I have discovered Faker() for this purpose. How can I adapt the below so that I can pass in the existing name as the 'seed_instance' so that the resulting 'fake' name will be the same for all instances of that name in the dataframe?
from faker import Faker
import pyspark.sql.functions as F
fullname_list = [[1,"Sarah Markwaithe"]
,[2,"John Bellamy"]
,[3,"Jordan Fingleberry"]
,[4,"Susan Merchant"]
,[5,"Bobby Franker"]
,[6,"Sally Smith-Holdern"]
,[7,"Finley Farringdon"]
,[8,"Sarah Markwaithe"]
,[9,"Simone Grath"]
,[10,"Frederick Balchum"]
]
df_schema = ["Id","FullName"]
# create example df
df = spark.createDataFrame(fullname_list, df_schema)
fake = Faker('en_GB')
fake_name = F.udf(fake.name)
df = df.withColumn("FakeFullName", fake_name())
df.display()
I understand that I can use seed_instance, but have no clue as to how to implement this in the code above so that I can pass "FullName" to the udf (apologies, Python newbie and tight delivery deadlines)
fake.seed_instance("Susan H")
fake.name()
Think I have worked out what to do. No idea whether it is the right approach (best practice, etc). Feel free to comment and let me know any other (and more efficient/Pythonic) methods:
from faker import Faker
import pyspark.sql.functions as F
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
fullname_list = [[1,"Sarah Markwaithe"]
,[2,"John Bellamy"]
,[3,"Jordan Fingleberry"]
,[4,"Susan Merchant"]
,[5,"Bobby Franker"]
,[6,"Sally Smith-Holdern"]
,[7,"Finley Farringdon"]
,[8,"Sarah Markwaithe"]
,[9,"Simone Grath"]
,[10,"Frederick Balchum"]
]
df_schema = ["Id","FullName"]
# create example df
df = spark.createDataFrame(fullname_list, df_schema)
fake = Faker('en_GB')
# create function that does what I need to do
def generate_fake_name(str):
fake.seed_instance(str)
return fake.name()
# Convert to UDF function
fake_name = udf(generate_fake_name, StringType())
# us UDF over dataframe
df = df.withColumn("FakeFullName", fake_name(col("FullName")))
df.show()
UPDATE: also including this if it helps someone else trying to achieve the same thing (I only wanted to generate a 'fake' name if the column contained a name): Updated dataframe above: ,[3,"Jordan Fingleberry"] to :,[3,""]
# use UDF over dataframe to overwrite the existing column
# only replace with a fake name if the column to be replaced contains a value
Removed:
df = df.withColumn("FullName", when(col("FullName") == "",lit(None)).otherwise(fake_name(col("FullName"))))
df.show()