I am working through the google cloud dlp api documentation available here specifically this question is about deidentify_with_fpe()
.
My question is what is the format of the arguments needing the be passed through the function for it to return anonymised data. My code at the moment is
def deidentify_with_fpe(
string,
info_types,
alphabet=1,
project='XXXX-data-development',
surrogate_type=None,
key_name='projects/XXXX-data-development/locations/global/keyRings/google-dlp-test-global/cryptoKeys/google-dlp-test-key-global',
wrapped_key=WRAPPED
):
"read file in for wrapped key"
"""Uses the Data Loss Prevention API to deidentify sensitive data in a
string using Format Preserving Encryption (FPE).
Args:
project: The Google Cloud project id to use as a parent resource.
item: The string to deidentify (will be treated as text).
alphabet: The set of characters to replace sensitive ones with. For
more information, see https://cloud.google.com/dlp/docs/reference/
rest/v2beta2/organizations.deidentifyTemplates#ffxcommonnativealphabet
surrogate_type: The name of the surrogate custom info type to use. Only
necessary if you want to reverse the deidentification process. Can
be essentially any arbitrary string, as long as it doesn't appear
in your dataset otherwise.
key_name: The name of the Cloud KMS key used to encrypt ('wrap') the
AES-256 key. Example:
key_name = 'projects/YOUR_GCLOUD_PROJECT/locations/YOUR_LOCATION/
keyRings/YOUR_KEYRING_NAME/cryptoKeys/YOUR_KEY_NAME'
wrapped_key: The encrypted ('wrapped') AES-256 key to use. This key
should be encrypted using the Cloud KMS key specified by key_name.
Returns:
None; the response from the API is printed to the terminal.
"""
# Import the client library
import google.cloud.dlp
# Instantiate a client
dlp = google.cloud.dlp_v2.DlpServiceClient(credentials='/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
dlp = dlp_client.from_service_account_json('/Users/callumsmyth/virtual_envs/google_dlp_test/XXXX.json')
# Convert the project id into a full resource id.
parent = dlp.project_path(project)
# The wrapped key is base64-encoded, but the library expects a binary
# string, so decode it here.
import base64
# wrapped_key = base64.b64decode(wrapped_key)
# Construct FPE configuration dictionary
crypto_replace_ffx_fpe_config = {
"crypto_key": {
"kms_wrapped": {
"wrapped_key": wrapped_key,
"crypto_key_name": key_name,
}
},
"common_alphabet": alphabet,
}
# Add surrogate type
if surrogate_type:
crypto_replace_ffx_fpe_config["surrogate_info_type"] = {
"name": surrogate_type
}
# Construct inspect configuration dictionary
inspect_config = {
"info_types": [{"name": info_type} for info_type in info_types]
}
# Construct deidentify configuration dictionary
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"primitive_transformation": {
"crypto_replace_ffx_fpe_config": crypto_replace_ffx_fpe_config
}
}
]
}
}
# Convert string to item
item = {"value": string}
# Call the API
response = dlp.deidentify_content(
parent,
inspect_config=inspect_config,
deidentify_config=deidentify_config,
item=item,
)
# Print results
print(response.item.value)
Where
with open('mysecret.txt.encrypted', 'rb') as f:
WRAPPED = f.read()
and the mysecret.txt.encrypted
was generated by this command in the terminal
--keyring google-dlp-test-global --key google-dlp-test-key-global \
--plaintext-file google-token.txt \
--ciphertext-file mysecret.txt.encrypted
When the google-token.txt was generated from here.
The error I am getting when calling deidentify_with_fpe('My name is john smith', ['FIRST_NAME'])
is as follows:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered."
debug_error_string = "{"created":"@1581675678.839972000","description":"Error received from peer ipv4:216.58.213.10:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.","grpc_status":3}"
which is a direct cause of:
InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.
So I think my issue is to do with the key - before it is encrypted. There is no where I can see in the documentation for how to source that key, or how to pass that into the function.
I appreciate this is a long and lengthy submission and any response would be appreciated, I've spent too long trying to do this and feel like I'm close to getting it to work
The error: “google.api_core.exceptions.InvalidArgument: 400 Could not de-identify all content due to transformation errors. See the error details for an overview of all the transformation errors encountered.”
This is a generic error when free-form text de-identification fails due to some transformation errors. Unfortunately, it seems like the python library is not exposing the error details.
As per the service documentation [1], the detected tokens must be at least two characters long:
The input value:
- Must be at least two characters long (or the empty string).
- Must be encoded as ASCII.
- Comprised of the characters specified by an "alphabet," which is the set of between 2 and 64 allowed characters in the input value. For more information, see the alphabet field in CryptoReplaceFfxFpeConfig.
[1] https://cloud.google.com/dlp/docs/transformations-reference#fpe