javagoogle-cloud-dlp

Google DLP - Can I use a delimiter to instruct DLP infotype detectors to search only inside that for sensitive text?


I have an issue while trying to deidentify some data with DLP using an object mapper to parse the object into string - send it to DLP for deidentification - getting back the deidentified string and using the object mapper to parse the string back to the initial object. Sometimes DLP will return a string that cannot be parsed back to the initial object (it breaks the json format of the object mapper)

I use an objectMapper to parse an Address object to string like this:

Address(
val postal_code: String,
val street: String,
val city: String,
val provence: String
)

and my objectmapper will transform this object into a string eg: "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My Provence\"}" which is sent to DLP and deidentified (using LOCATION or STREET_ADDRESS detectors).

The issue is that my object mapper would expect to take back the deidentified string and parse it back to my Address object using the same json format eg: "{\"postal_code\":\"LOCATION_TOKEN(10):asdf\",\"street\":\"LOCATION_TOKEN(10):asdf\",\"city\":\"LOCATION_TOKEN(10):asdf\",\"provence\":\"LOCATION_TOKEN(10):asdf\"}"

But there are a lot of times that DLP will return something like "{"LOCATION_TOKEN(25):asdfasdfasdf)\",\"provence\":\"LOCATION_TOKEN(10):asdf\"}" - basically breaking the json format and i am unable to parse back the string from DLP to my initial object

Is there a way to instruct DLP infotype detectors to keep the json format, or to look for sensitive text only inside \" * \"?

Thanks


Solution

  • There are some options here using a custom regex and a detection ruleset in order to define a boundary on matches.

    The general idea is that you require that findings must match both an infoType (e.g. STREET_ADDRESS, LOCATION, PERSON_NAME, etc.) and your custom infoType before reporting as a finding or for redaction. By requiring that both match, you can set bounds on where the infoType can detect.

    Here is an example.

    {
      "item": {
        "value": "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My Provence\"}"
      },
      "inspectConfig": {
        "customInfoTypes": [
          {
            "infoType": {
              "name": "CUSTOM_BLOCK"
            },
            "regex": {
              "pattern": "(:\")([^,]*)(\")",
              "groupIndexes": [
                2
              ]
            },
            "exclusionType": "EXCLUSION_TYPE_EXCLUDE"
          }
        ],
        "infoTypes": [
          {
            "name": "EMAIL_ADDRESS"
          },
          {
            "name": "LOCATION"
          },
          {
            "name": "PERSON_NAME"
          }
        ],
        "ruleSet": [
          {
            "infoTypes": [
              {
                "name": "LOCATION"
              }
            ],
            "rules": [
              {
                "exclusionRule": {
                  "excludeInfoTypes": {
                    "infoTypes": [
                      {
                        "name": "CUSTOM_BLOCK"
                      }
                    ]
                  },
                  "matchingType": "MATCHING_TYPE_INVERSE_MATCH"
                }
              }
            ]
          }
        ]
      },
      "deidentifyConfig": {
        "infoTypeTransformations": {
          "transformations": [
            {
              "primitiveTransformation": {
                "replaceWithInfoTypeConfig": {}
              }
            }
          ]
        }
      }
    }
    
    
    

    Example output:

      "item": {
        "value": "{\"postal_code\":\"123ABC\",\"street\":\"Street Name\",\"city\":\"My City\",\"provence\":\"My [LOCATION]\"}"
      },
    

    By setting "groupIndexes" to 2 we are indicating that we only want the custom infoType to match the middle (or second) regex group and not allow the :" or " to be part of the match. Also, in this example we mark the custom infoType as EXCLUSION_TYPE_EXCLUDE so that it does not report itself:

    "exclusionType": "EXCLUSION_TYPE_EXCLUDE"
    

    If you remove this line, anything matching your infoType could also get redacted. This can be useful for testing though - example output:

      "item": {
        "value": "{\"postal_code\":\"[CUSTOM_BLOCK]\",\"street\":\"[CUSTOM_BLOCK]\",\"city\":\"[CUSTOM_BLOCK]\",\"provence\":\"[CUSTOM_BLOCK][LOCATION]\"}"
      },
    ...
    

    Hope this helps.