postgresqljsonbpostgresql-json

How to select filtered postgresql jsonb field with performance prioritization?


A table:

CREATE TABLE events_holder(
    id serial primary key,
    version int not null,
    data jsonb not null
);

Data field can be very very large (up to 100 Mb) and looks like this:

{
  "id": 5,
  "name": "name5",
  "events": [
    {
      "id": 255,
      "name": "festival",
      "start_date": "2022-04-15",
      "end_date": "2023-04-15",
      "values": [
        {
          "id": 654,
          "type": "text",
          "name": "importance",
          "value": "high"
        },
        {
          "id": 655,
          "type": "boolean",
          "name": "epic",
          "value": "true"
        }
      ]
    },
    {
      "id": 256,
      "name": "discovery",
      "start_date": "2022-02-20",
      "end_date": "2022-02-22",
      "values": [
        {
          "id": 711,
          "type": "text",
          "name": "importance",
          "value": "low"
        },
        {
          "id": 712,
          "type": "boolean",
          "name": "specificAttribute",
          "value": "false"
        }
      ]
    }
  ]
}

I want to select data field by version, but filtered with extra condition: where events end_date > '2022-03-15'. And the output must look like this:

{
  "id": 5,
  "name": "name5",
  "events": [
    {
      "id": 255,
      "name": "festival",
      "start_date": "2022-04-15",
      "end_date": "2023-04-15",
      "values": [
        {
          "id": 654,
          "type": "text",
          "name": "importance",
          "value": "high"
        },
        {
          "id": 655,
          "type": "boolean",
          "name": "epic",
          "value": "true"
        }
      ]
    }
  ]
}

How can I do this with maximum performance? How should I index the data field?

My primary solution:

with cte as (
  select eh.id, eh.version, jsonb_agg(events) as filteredEvents from events_holder eh
  cross join jsonb_array_elements(eh.data #> '{events}') as events
  where version = 1 and (events ->> 'end_date')::timestamp >= '2022-03-15'::timestamp
  group by id, version
)
select jsonb_set(data, '{events}', cte.filteredEvents) from events_holder, cte 
where events_holder.id = cte.id;

But i don't think it's a good variant.


Solution

  • You can do this using a JSON path expression:

    select eh.id, eh.version, 
           jsonb_path_query_array(data, 
                                  '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())')
    from events_holder eh
    where eh.version = 1
      and eh.data @? '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())'
    

    Given your example JSON, this returns:

    [
        {
            "id": 255,
            "name": "festival",
            "values": [
                {
                    "id": 654,
                    "name": "importance",
                    "type": "text",
                    "value": "high"
                },
                {
                    "id": 655,
                    "name": "epic",
                    "type": "boolean",
                    "value": "true"
                }
            ],
            "end_date": "2023-04-15",
            "start_date": "2022-04-15"
        }
    ]
    

    Depending on your data distribution a GIN index on data or an index on version could help.

    If you need to re-construct the whole JSON content but with just a filtered events array, you can do something like this:

    select (data - 'events')||
            jsonb_build_object('events', jsonb_path_query_array(data, '$.events[*] ? (@.end_date.datetime() >= "2022-03-15".datetime())'))
    from events_holder eh
    ...
    

    (data - 'events') removes the events key from the json. Then the the result of the JSON path query is appended back to that (partial) object.