I have Python code to generate SQL queries from English queries. But while predicting, I might have to send sensitive data in my English query to the model. I want to mask sensitive information like nouns and numbers in my English query. When I receive the predicted query, I want to unmask that data again.
In short, I need a python program that can mask nouns and numbers in my string and then unmask them whenever I want them to. We can replace it with anything you suggest.
Sample English Query:
How many Chocolate Orders for a customer with ID 123456?
Masking Expected Output:
How many xxxxxxxxxx Orders for a customer with ID xxxxxxxxx?
My algorithm with create the query like:
Select count(1) from `sample-bucket` as d where d.Type ='xxxxxxxx' and d.CustId = 'xxxxxxx'
Now I need the unmasked query like below:
Select count(1) from `sample-bucket` as d where d.Type ='Chocolate' and d.CustId = '123456'
You can use below code for the masking and unmasking a string. This way you can retain the words in the dictionary and can use them later on when you want to unmask the string. I think this code can be very helpful for the people using third party tools.
import base64
import nltk
nltk.download('averaged_perceptron_tagger')
def base_64_encoding(text):
return str(base64.b64encode(text.encode("utf-8")).decode("utf-8"))
def base_64_decoding(text):
return str(base64.b64decode(text.encode('utf-8')).decode('utf-8'))
masked_element = {}
english_query = "How many Chocolate Orders for a customer with ID 123456?"
print("English Query: ", english_query)
for word in english_query.split(" "):
ans = nltk.pos_tag([word])
val = ans[0][1]
if val == 'NN' or val == 'NNS' or val == 'NNPS' or val == 'NNP':
masked_element[word] = base_64_encoding(word)
english_query = english_query.replace(word, base_64_encoding(word))
if word.isdigit():
masked_element[word] = base_64_encoding(word)
english_query = english_query.replace(word, base_64_encoding(word))
print("Masked Query: ", english_query)
for key, val in masked_element.items():
if val in english_query:
english_query = english_query.replace(val, key)
print("Unmasked English Query: ", english_query)