I'm working with event sourced data where all of the important fields are combined to a JSONB column and many of the keys are missing from most of the database rows.
I would want to get:
I tried this on my own and I was able to produce a example which generates exactly what I would want to achieve here but it looks rather ugly and I'm wondering how to make following query better.
Schema (PostgreSQL v15)
CREATE TABLE events (
id SERIAL PRIMARY KEY,
identifier VARCHAR(255),
timestamp TIMESTAMP WITH TIME ZONE,
event_data JSONB
);
INSERT INTO events (identifier, timestamp, event_data)
VALUES
('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "user": "peterpan"}'),
('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "user": "robinhood"}'),
('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "user": "mickeymouse"}'),
('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');
Query #1
WITH events_flattened AS (
SELECT
identifier,
timestamp,
event_data->>'target' AS target,
event_data->>'user' AS user,
elem.part
FROM events
LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
ORDER BY timestamp DESC
)
SELECT
identifier,
(ARRAY_REMOVE(ARRAY_AGG(e.target),NULL))[1] as target,
(ARRAY_REMOVE(ARRAY_AGG(e.user),NULL))[1] as user,
ARRAY_REMOVE(ARRAY_AGG(part),NULL) as ingredients,
MAX(timestamp) as latest_update
FROM events_flattened e
GROUP BY identifier;
For the answer to be helpful it should produce exactly same result as seen on this table:
identifier | target | user | ingredients | latest_update |
---|---|---|---|---|
12345 | 12345678 | robinhood | Melon,Strawberry,Banana | 2019-01-04T00:00:00.000Z |
67890 | 0000 | mickeymouse | Potato | 2019-01-04T00:00:00.000Z |
I'm trying to determine which kind of query and indexes would be beneficial to get exactly this kind of data out from this table?
Here's the query I ended up using here. Using the subqueries and jsonb_array_elements
was something my acquaintance Tommi Vainikainen came up with:
SELECT
e.identifier,
(
SELECT e2.event_data->>'target'
FROM events e2
WHERE e2.identifier=e.identifier
AND e2.event_data->>'target' IS NOT NULL
ORDER by timestamp DESC LIMIT 1
) AS target,
(
SELECT e2.event_data->>'user'
FROM events e2
WHERE e2.identifier=e.identifier
AND e2.event_data->>'user' IS NOT NULL
ORDER by timestamp DESC LIMIT 1
) AS user,
ARRAY(
SELECT jsonb_array_elements(e2.event_data->'ingredients')
FROM events e2
WHERE e2.identifier=e.identifier
AND e2.event_data->>'ingredients' IS NOT NULL
) AS ingredients,
MAX(timestamp) AS latest_update
FROM events e
GROUP BY identifier;