Description of problem: I'm in the process of working with a sensitive data set (contains the people's telephone number information as one of the columns). I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis.
It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process.
My dataset is hundreds of GB in size.
Because they qualify as sensitive data, phone numbers should not be a part of our analysis. So, basically I would need a one-way hashing function but without redundancy. Each phone number should map to unique value. Two phones numbers should not map to a same value.
How can I handle this?
Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).
Example (Hmac-Sha1 with 256bit key) using Keyczar:
Create random secret key:
$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary
Anonymize phone number:
from keyczar import keyczar
def anonymize(phone_num):
signer = keyczar.Signer.Read("path_to_key_set");
return signer.Sign(phone_num)