sqlpostgresqlevent-sourcing

How to get last non null values and aggregated non null values by timestamp from event sourcing JSONB column in PostgreSQL?


I'm working with event sourced data where all of the important fields are combined to a JSONB column and many of the keys are missing from most of the database rows.

I would want to get:

  1. Aggregated combined values of included arrays in the JSONB field (see ingredients in the example)
  2. Latest non-null value according to timestamp

I tried this on my own and I was able to produce a example which generates exactly what I would want to achieve here but it looks rather ugly and I'm wondering how to make following query better.

Schema (PostgreSQL v15)

CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    identifier VARCHAR(255),
    timestamp TIMESTAMP WITH TIME ZONE,
    event_data JSONB
);
INSERT INTO events (identifier, timestamp, event_data)
VALUES
    ('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
    ('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
    ('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "user": "peterpan"}'),
    ('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "user": "robinhood"}'),
    ('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "user": "mickeymouse"}'),
    ('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');

Query #1

WITH events_flattened AS (
    SELECT
        identifier,
        timestamp,
        event_data->>'target' AS target,
        event_data->>'user' AS user,
        elem.part
    FROM events
    LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
    ORDER BY timestamp DESC
)

SELECT
    identifier,
    (ARRAY_REMOVE(ARRAY_AGG(e.target),NULL))[1] as target,
    (ARRAY_REMOVE(ARRAY_AGG(e.user),NULL))[1] as user,
    ARRAY_REMOVE(ARRAY_AGG(part),NULL) as ingredients,
    MAX(timestamp) as latest_update
FROM events_flattened e
GROUP BY identifier;

For the answer to be helpful it should produce exactly same result as seen on this table:

identifier target user ingredients latest_update
12345 12345678 robinhood Melon,Strawberry,Banana 2019-01-04T00:00:00.000Z
67890 0000 mickeymouse Potato 2019-01-04T00:00:00.000Z

View on DB Fiddle

I'm trying to determine which kind of query and indexes would be beneficial to get exactly this kind of data out from this table?


Solution

  • Here's the query I ended up using here. Using the subqueries and jsonb_array_elements was something my acquaintance Tommi Vainikainen came up with:

    
    SELECT
      e.identifier,
      (
        SELECT e2.event_data->>'target'
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'target' IS NOT NULL
        ORDER by timestamp DESC LIMIT 1
      ) AS target,
      (
        SELECT e2.event_data->>'user'
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'user' IS NOT NULL
        ORDER by timestamp DESC LIMIT 1
      ) AS user,
      ARRAY(
        SELECT jsonb_array_elements(e2.event_data->'ingredients')
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'ingredients' IS NOT NULL
      ) AS ingredients,
      MAX(timestamp) AS latest_update
    FROM events e
    GROUP BY identifier;