PrestoSQL transform event data into hourly summary

I have a table containing events and timestamps

ts	event	name
1650000000	everything is ok	process_a
1650003700	something is broken!	process_a
1650007100	everything is ok	process_a
1650010000	everything is ok	process_b
1650013100	something is broken!	process_b
1650017400	everything is ok	process_b

I want to calculate, for every hour, what % of the hour "everything is ok" and what % of the hour "something is broken!", for each process. The final results should be something like

hour	name	ok_perc
...	...	...
2022-04-22 7:00:00	process_a	.912
2022-04-22 8:00:00	process_a	.634
2022-04-22 9:00:00	process_a	1
2022-04-22 1:00:00	process_b	.354
2022-04-22 2:00:00	process_b	.533
2022-04-22 3:00:00	process_b	.987
...	...	...

I have worked on various subqueries to help me reach what I am trying to. The first problem I realised I was going to have is that I won't have all the hours, but only hours when there is a change of state. So I created:

    select 
        timestamp_column
    from
        (values 
            (sequence(cast('2022-01-01' as timestamp),  --don't bother that this doesn't match my pseudo timestamps in the events table
                      cast(now() as timestamp),
                      interval '1' hour
                     )
            )
        ) as t(timestamp_array)
    cross join 
        unnest(timestamp_array) as t1(timestamp_column)

The above gives me all the hour intervals where the events are going to happen.

Then I left joined the hourly timestamps with my events table on a date_trunc with the from_unixtime(ts), so if there is an event in that hour I have values else I have nulls. Note that there can be multiple events in an hour.

For some reason I thought that persisting the last event to the next hour, if the event in the next hour was null, was a good idea

coalesce
            (
                event,
                lag(event) ignore nulls over 
                    (
                        partition by name
                        order by timestamp_column
                    )
            )

And I also think something else that was going to help me figure this out was going to calculate the duration of each event, in this way (take the previous event timestamp minus the current event timestamp):

        lead(ts) over 
                (
                    partition by 
                        name
                    order by ts
                ) - ts as seconds_in_state

For some reason I thought I'd get to a point where I would know, in a given hour, how many seconds I was in one state and how many seconds I was in another. But myseconds_in_state are sometimes longer than an hour, which tells me I am not going in the right direction.

In general this seems a pretty common issue: how do I summarise an events table by a certain time interval, were each event has an implicit state of a thing name?

Somehow I got stuck on this and I am stubborn to fix it in Presto SQL rather than downloading the events data and do some manipulation in Python - it should definitely be possible!

Solution

Here is a possible approach - generate an hourly intervals between current and next timestamp for partition by name, then flatten the generated array with unnest and use group by name and hour to perform needed calculation:

-- sample data
WITH dataset (ts, event, name) AS (
    VALUES (1650000000, 'everything is ok', 'process_a'),
      (1650003700, 'something is broken!', 'process_a'),
      (1650007100,  'everything is ok', 'process_a'),
      (1650010000,  'everything is ok', 'process_b'),
      (1650013100,  'something is broken!', 'process_b'),
      (1650017400,  'everything is ok','process_b')
) 

-- query
select name, 
    ts_hour_exp hour, 
    count_if(is_ok) * 1.0 / count(*) ok_perc
from (
        select date_trunc('hour', from_unixtime(ts)) ts_hour,
            if(event = 'everything is ok', true, false) is_ok, -- reduced strings to boolean flag 
            lead(date_trunc('hour', from_unixtime(ts))) over(
                partition by name
                order by ts
            ) next_ts,
            name
        from dataset
    )
cross join unnest ( -- some magic for interval generation
        coalesce(
            array_except(
                sequence(ts_hour, next_ts, interval '1' hour),
                array [ ts_hour, next_ts ] -- exclude borders
            ),
            array [ ] -- in case of null
        ) || ts_hour -- attach current hour
    ) as t(ts_hour_exp)
group by name, ts_hour_exp
order by name, ts_hour_exp

Output:

name	hour	ok_perc
process_a	2022-04-15 05:00:00.000	1.0
process_a	2022-04-15 06:00:00.000	0.0
process_a	2022-04-15 07:00:00.000	1.0
process_b	2022-04-15 08:00:00.000	0.5
process_b	2022-04-15 09:00:00.000	0.0
process_b	2022-04-15 10:00:00.000	1.0