azureazure-databricks

how to hash users into random values in azure databricks


We have an Excel file in file store that contains more than 10,000 columns of JSON data in each excel file.

for example, a sample like below. ** note: i have Excel file in file store it contains more than 10,000 columns of JSON data .like 10,000 json i need to read whole excel and perform users transformation **

json 1:

{"SearchName":"","Id":"","RequestType":"","StartDateUtc":"2022-12-01T00:00:00Z","EndDateUtc":"2023-04-28T00:00:00Z","RecordType":null,"Workload":"","Operations":[],"Users":["d503246e-285c-41bc-8b0a-bc79824146ea,ingrid.van.driel@vroon.nl,ab6019a4c-1e03ee9be97a,bart.van.someren@vroon.nl,85ff-b51c-b88ad4d55b5a,nicky.ongenae@vroon.nl,48168530-6-8985-65f9b0af2b85,erwin.weeda@vroon.nl,0937a1e5-8a68-4573-ae9c-e13f9a2f3617,Thessa.vanden.Oetelaar@vroon.nl,c822dd8b-0b79-4c13-af1e-bc080b8108c5,Hester.Blankenstein@vroon.nl,ca0de5ba-6ab2-4d34-b19d-ca702dcbdb8d,Alvin.Baltado@ph.vroonshipmanagement.com"],"ObjectIds":[],"IPAddresses":[],"SiteIds":null,"AssociatedAdminUnits":[],"FreeText":"multifactor","ResultSize":0,"TimeoutInSeconds":345600,"ScopedAdminWithoutAdminUnits":false}

json2:

{"SearchName":"xiong.jie.wu2"",""Id":"6797200c-4-a40c-6d8cfe7d6c16"",""RequestType":"AuditSearch"",""StartDateUtc":"2023-12-01T00:00:00Z"",""EndDateUtc":"2024-01-26T00:00:00Z"",""RecordType":null",""RecordTypes":[]",""Workload":null",""Workloads":[]",""WorkloadsToInclude":null",""WorkloadsToExclude":null",""ScopedWorkloadSearchEnabled":false",""Operations":["copy"",""harddelete"",""movetodeleteditems"",""move"",""softdelete"",""new-inboxrule"",""set-inboxrule"",""updateinboxrules"",""add-mailboxpermission"",""addfolderpermissions"",""modifyfolderpermissions"]",""Users":["xiong@contoso.com"]",""ObjectIds":[]",""IPAddresses":[]",""SiteIds":null",""AssociatedAdminUnits":[]",""FreeText":""",""ResultSize":0",""TimeoutInSeconds":345600",""ScopedAdminWithoutAdminUnits":false}

...........................like this 10000 json in one excel file

We just want to change the user hash value to normal mask values.
like this: Alvin.Baltonado@ph.vroonshipmanagement.com into sam@contoso.com for whole excel file users.

Every time we manually copy users data and masking like below, it takes a lot of time for us. Then, after whatever output we got, we just replaced the hash value with the output.

import random

main=['nzn1@contoso.com', 'oman2@contoso.com', 'oman3@contoso.com', 'oman4@contoso.com', 'oman5@contoso.com', 'oman6@contoso.com', 'oman7@contoso.com', 'oman8@contoso.com', 'oman9@contoso.com', 'omaz1@contoso.com', 'omaz2@contoso.com', 'omaz3@contoso.com', 'omaz4@contoso.com', 'omaz5@contoso.com', 'omaom6@contoso.com', 'omax7@contoso.com', 'omaz8@contoso.com', 'omaz9@contoso.com', 'omay1@contoso.com', 'omay2@contoso.com', 'omaom3@contoso.com', 'omax4@contoso.com', 'omax5@contoso.com', 'omax6@contoso.com', 'omaom7@contoso.com', 'omaw8@contoso.com', 'omaw9@contoso.com', 'omae1@contoso.com', 'omae2@contoso.com', 'omae3@contoso.com', 'omae4@contoso.com', 'omae5@contoso.com']  
  
l=["0e07209b-807b-4938-8bfd-f87cee98e924,invoices@it.vroonoffshore.com,c747a82c-656e-40eb-9194-88c4a0f8061e"]  
n=len(l)  
print(n)  
print(random.sample(main,n))

My question Is there any way in azure databricks replace whole Excel file user json key hash values to random users like this ss@contoso.com at a time and rewrite back to particular location


Solution

  • I have tried the below approach: Install the below libraries:

    %pip install openpyxl
    %pip install xlrd
    

    In the below code It masks email addresses within JSON data stored in an Excel file:

    import pandas as pd
    import json
    import os
    dbfs_input_path = 'dbfs:/FileStore/tables/sample_excel_file.xlsx'
    local_input_path = '/tmp/sample_excel_file.xlsx'
    dbutils.fs.cp(dbfs_input_path, 'file:' + local_input_path)
    df = pd.read_excel(local_input_path)
    def mask_emails(user_list):
        return ["sam@contoso.com" for _ in user_list]
    for col in df.columns:
        for idx, json_data in df[col].items():
            try:
                json_obj = json.loads(json_data)
                if 'Users' in json_obj:
                    json_obj['Users'] = mask_emails(json_obj['Users'])
                df.at[idx, col] = json.dumps(json_obj)
            except (json.JSONDecodeError, TypeError):
                continue  
    
    
    local_output_path = '/tmp/transformed_excel_file.xlsx'
    df.to_excel(local_output_path, index=False)
    dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
    dbutils.fs.cp('file:' + local_output_path, dbfs_output_path)
    print(f"Transformed file saved to {dbfs_output_path}")
    
    Transformed file saved to dbfs:/FileStore/tables/transformed_excel_file.xlsx
    

    In the above code using dbutils.fs.cp to copy files between DBFS and the local filesystem. And temporarily storing files locally for processing, and make sure they are copied back to DBFS.

    Results:

    dbfs_output_path = 'dbfs:/FileStore/tables/transformed_excel_file.xlsx'
    local_output_path = '/tmp/transformed_excel_file.xlsx'
    dbutils.fs.cp(dbfs_output_path, 'file:' + local_output_path)
    df_transformed = pd.read_excel(local_output_path)
    df_transformed.display()
    
    Column1 Column2
    {"SearchName": "", "Id": "", "RequestType": "", "StartDateUtc": "2022-12-01T00:00:00Z", "EndDateUtc": "2023-04-28T00:00:00Z", "RecordType": null, "Workload": "", "Operations": [], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "multifactor", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}   {"SearchName": "xiong.jie.wu2", "Id": "6797200c-4-a40c-6d8cfe7d6c16", "RequestType": "AuditSearch", "StartDateUtc": "2023-12-01T00:00:00Z", "EndDateUtc": "2024-01-26T00:00:00Z", "RecordType": null, "RecordTypes": [], "Workload": null, "Workloads": [], "WorkloadsToInclude": null, "WorkloadsToExclude": null, "ScopedWorkloadSearchEnabled": false, "Operations": ["copy", "harddelete", "movetodeleteditems", "move", "softdelete", "new-inboxrule", "set-inboxrule", "updateinboxrules", "add-mailboxpermission", "addfolderpermissions", "modifyfolderpermissions"], "Users": ["sam@contoso.com"], "ObjectIds": [], "IPAddresses": [], "SiteIds": null, "AssociatedAdminUnits": [], "FreeText": "", "ResultSize": 0, "TimeoutInSeconds": 345600, "ScopedAdminWithoutAdminUnits": false}