How to select filtered postgresql jsonb field with performance prioritization?

A table:

CREATE TABLE events_holder(
    id serial primary key,
    version int not null,
    data jsonb not null
);

Data field can be very very large (up to 100 Mb) and looks like this:

{
  "id": 5,
  "name": "name5",
  "events": [
    {
      "id": 255,
      "name": "festival",
      "start_date": "2022-04-15",
      "end_date": "2023-04-15",
      "values": [
        {
          "id": 654,
          "type": "text",
          "name": "importance",
          "value": "high"
        },
        {
          "id": 655,
          "type": "boolean",
          "name": "epic",
          "value": "true"
        }
      ]
    },
    {
      "id": 256,
      "name": "discovery",
      "start_date": "2022-02-20",
      "end_date": "2022-02-22",
      "values": [
        {
          "id": 711,
          "type": "text",
          "name": "importance",
          "value": "low"
        },
        {
          "id": 712,
          "type": "boolean",
          "name": "specificAttribute",
          "value": "false"
        }
      ]
    }
  ]
}

I want to select data field by version, but filtered with extra condition: where events end_date > '2022-03-15'. And the output must look like this:

{
  "id": 5,
  "name": "name5",
  "events": [
    {
      "id": 255,
      "name": "festival",
      "start_date": "2022-04-15",
      "end_date": "2023-04-15",
      "values": [
        {
          "id": 654,
          "type": "text",
          "name": "importance",
          "value": "high"
        },
        {
          "id": 655,
          "type": "boolean",
          "name": "epic",
          "value": "true"
        }
      ]
    }
  ]
}

How can I do this with maximum performance? How should I index the data field?

My primary solution:

with cte as (
  select eh.id, eh.version, jsonb_agg(events) as filteredEvents from events_holder eh
  cross join jsonb_array_elements(eh.data #> '{events}') as events
  where version = 1 and (events ->> 'end_date')::timestamp >= '2022-03-15'::timestamp
  group by id, version
)
select jsonb_set(data, '{events}', cte.filteredEvents) from events_holder, cte 
where events_holder.id = cte.id;

But i don't think it's a good variant.

Solution

You can do this using a JSON path expression:

select eh.id, eh.version, 
       jsonb_path_query_array(data, 
                              '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())')
from events_holder eh
where eh.version = 1
  and eh.data @? '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())'

Given your example JSON, this returns:

[
    {
        "id": 255,
        "name": "festival",
        "values": [
            {
                "id": 654,
                "name": "importance",
                "type": "text",
                "value": "high"
            },
            {
                "id": 655,
                "name": "epic",
                "type": "boolean",
                "value": "true"
            }
        ],
        "end_date": "2023-04-15",
        "start_date": "2022-04-15"
    }
]

Depending on your data distribution a GIN index on data or an index on version could help.

If you need to re-construct the whole JSON content but with just a filtered events array, you can do something like this:

select (data - 'events')||
        jsonb_build_object('events', jsonb_path_query_array(data, '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())'))
from events_holder eh
...

(data - 'events') removes the events key from the json. Then the the result of the JSON path query is appended back to that (partial) object.