pythonpython-3.xmasking

Masking and unmasking a string in Python


I have Python code to generate SQL queries from English queries. But while predicting, I might have to send sensitive data in my English query to the model. I want to mask sensitive information like nouns and numbers in my English query. When I receive the predicted query, I want to unmask that data again.

In short, I need a python program that can mask nouns and numbers in my string and then unmask them whenever I want them to. We can replace it with anything you  suggest.

Sample English Query:

How many Chocolate Orders for a customer with ID 123456?

Masking Expected Output:

How many xxxxxxxxxx Orders for a customer with ID xxxxxxxxx? 

My algorithm with create the query like:

Select count(1) from `sample-bucket` as d where d.Type ='xxxxxxxx' and d.CustId = 'xxxxxxx'

Now I need the unmasked query like below:

Select count(1) from `sample-bucket` as d where d.Type ='Chocolate' and d.CustId = '123456'

Solution

  • You can use below code for the masking and unmasking a string. This way you can retain the words in the dictionary and can use them later on when you want to unmask the string. I think this code can be very helpful for the people using third party tools.

    import base64 
    import nltk
    
    nltk.download('averaged_perceptron_tagger')
    
    def base_64_encoding(text):
        return str(base64.b64encode(text.encode("utf-8")).decode("utf-8"))
    
    def base_64_decoding(text):
        return str(base64.b64decode(text.encode('utf-8')).decode('utf-8'))
    
    masked_element = {}
    english_query = "How many Chocolate Orders for a customer with ID 123456?"
    print("English Query: ", english_query)
    for word in english_query.split(" "):
        ans = nltk.pos_tag([word])
        val = ans[0][1]
        if val == 'NN' or val == 'NNS' or val == 'NNPS' or val == 'NNP':
            masked_element[word] = base_64_encoding(word)
            english_query = english_query.replace(word, base_64_encoding(word))
        if word.isdigit():
            masked_element[word] = base_64_encoding(word)
            english_query = english_query.replace(word, base_64_encoding(word))
    print("Masked Query: ", english_query)
    
    for key, val in masked_element.items():
        if val in english_query:
            english_query = english_query.replace(val, key)
    print("Unmasked English Query: ", english_query)
    

    Below is the output of above program: enter image description here