jsonjqpii

How to retain only a set of specified fields in a complex JSON object?


Summary:

I need to filter PII data from a complex JSON object in one-pass with JQ. I'm not in scripting where deconstruct and/or multi-pass. I want to retain non-PII properties rather than remove PII properties because I want to avoid a PII leakage exposure if the backend adds a new PII property and doesn't let me know.

In simple cases, I can easily "reconstruct" the desired JSON object from the input object like this:

{
 "data": {
   "id": "123",
   "pii": "sensitve"
 }
 "return-code": 200
}

jq '{data: {id: .data.id }, return-code: .return-code}'

Once arrays are added to the mix, I'm not seeing how to use this approach to solve this problem.

Simplified Example of the Complex Object

Input:

{
  "customers": [
    {
      "id": "00000000001",
      "dateOfBirth": "sensitive DOB",
      "preferences": [
        {
          "preference-id": "0001",
          "pii-value": "senstive value 1"
        },
        {
          "preference-id": "0002",
          "pii-value": "senstive value 2"
        }
      ]
    },
    {
      "id": "00000000002",
      "dateOfBirth": "sensitive DOB",
      "preferences": [
        {
          "preference-id": "0003",
          "pii-value": "senstive value 3"
        },
        {
          "preference-id": "0004",
          "pii-value": "senstive value 4"
        }
      ]
    }
  ]
}

Desired Output:

{
  "customers": [
    {
      "id": "00000000001",
      "preferences": [
        {
          "preference-id": "0001"
        },
        {
          "preference-id": "0002"
        }
      ]
    },
    {
      "id": "00000000002",
      "preferences": [
        {
          "preference-id": "0003"
        },
        {
          "preference-id": "0004"
        }
      ]
    }
  ]
}

Attempted Approach with Arrays:

jq '{ customers: [ { id: .customers[].id, preferences: [ .customers[].preferences ] } ]}'

Result starts pulling together permutations across different customers

{
  "customers": [
    {
      "id": "00000000001",
      "preferences": [
        [
          {
            "preference-id": "0001",
            "pii-value": "senstive value 1"
          },
          {
            "preference-id": "0002",
            "pii-value": "senstive value 2"
          }
        ],
        [
          {
            "preference-id": "0003",
            "pii-value": "senstive value 3"
          },
          {
            "preference-id": "0004",
            "pii-value": "senstive value 4"
          }
        ]
      ]
    },
    {
      "id": "00000000002",
      "preferences": [
        [
          {
            "preference-id": "0001",
            "pii-value": "senstive value 1"
          },
          {
            "preference-id": "0002",
            "pii-value": "senstive value 2"
          }
        ],
        [
          {
            "preference-id": "0003",
            "pii-value": "senstive value 3"
          },
          {
            "preference-id": "0004",
            "pii-value": "senstive value 4"
          }
        ]
      ]
    }
  ]
}

I really don't think this approach will work at all and I'm at a loss for other approaches. This is a simplified example, the actual JSON is quite large with many arrays at various nested levels.

Any suggestions for what approach I might investigate?


Solution

  • Using a user function for picking specific paths from a JSON:

    def pick(paths):
      . as $in
      | reduce path(paths) as $path (null;
        setpath($path; $in | getpath($path))
      );
    pick(.customers[] | .id, .preferences[]."preference-id")
    

    Online demo