I have a table containing events and timestamps
ts | event | name |
---|---|---|
1650000000 | everything is ok | process_a |
1650003700 | something is broken! | process_a |
1650007100 | everything is ok | process_a |
1650010000 | everything is ok | process_b |
1650013100 | something is broken! | process_b |
1650017400 | everything is ok | process_b |
I want to calculate, for every hour, what % of the hour "everything is ok" and what % of the hour "something is broken!", for each process. The final results should be something like
hour | name | ok_perc |
---|---|---|
... | ... | ... |
2022-04-22 7:00:00 | process_a | .912 |
2022-04-22 8:00:00 | process_a | .634 |
2022-04-22 9:00:00 | process_a | 1 |
2022-04-22 1:00:00 | process_b | .354 |
2022-04-22 2:00:00 | process_b | .533 |
2022-04-22 3:00:00 | process_b | .987 |
... | ... | ... |
I have worked on various subqueries to help me reach what I am trying to. The first problem I realised I was going to have is that I won't have all the hours, but only hours when there is a change of state. So I created:
select
timestamp_column
from
(values
(sequence(cast('2022-01-01' as timestamp), --don't bother that this doesn't match my pseudo timestamps in the events table
cast(now() as timestamp),
interval '1' hour
)
)
) as t(timestamp_array)
cross join
unnest(timestamp_array) as t1(timestamp_column)
The above gives me all the hour intervals where the events are going to happen.
Then I left joined the hourly timestamps with my events table on a date_trunc
with the from_unixtime(ts)
, so if there is an event in that hour I have values else I have nulls. Note that there can be multiple events in an hour.
For some reason I thought that persisting the last event to the next hour, if the event in the next hour was null, was a good idea
coalesce
(
event,
lag(event) ignore nulls over
(
partition by name
order by timestamp_column
)
)
And I also think something else that was going to help me figure this out was going to calculate the duration of each event, in this way (take the previous event timestamp minus the current event timestamp):
lead(ts) over
(
partition by
name
order by ts
) - ts as seconds_in_state
For some reason I thought I'd get to a point where I would know, in a given hour, how many seconds I was in one state and how many seconds I was in another. But myseconds_in_state
are sometimes longer than an hour, which tells me I am not going in the right direction.
In general this seems a pretty common issue: how do I summarise an events table by a certain time interval, were each event has an implicit state of a thing name
?
Somehow I got stuck on this and I am stubborn to fix it in Presto SQL rather than downloading the events data and do some manipulation in Python - it should definitely be possible!
Here is a possible approach - generate an hourly intervals between current and next timestamp for partition by name, then flatten the generated array with unnest
and use group by
name and hour to perform needed calculation:
-- sample data
WITH dataset (ts, event, name) AS (
VALUES (1650000000, 'everything is ok', 'process_a'),
(1650003700, 'something is broken!', 'process_a'),
(1650007100, 'everything is ok', 'process_a'),
(1650010000, 'everything is ok', 'process_b'),
(1650013100, 'something is broken!', 'process_b'),
(1650017400, 'everything is ok','process_b')
)
-- query
select name,
ts_hour_exp hour,
count_if(is_ok) * 1.0 / count(*) ok_perc
from (
select date_trunc('hour', from_unixtime(ts)) ts_hour,
if(event = 'everything is ok', true, false) is_ok, -- reduced strings to boolean flag
lead(date_trunc('hour', from_unixtime(ts))) over(
partition by name
order by ts
) next_ts,
name
from dataset
)
cross join unnest ( -- some magic for interval generation
coalesce(
array_except(
sequence(ts_hour, next_ts, interval '1' hour),
array [ ts_hour, next_ts ] -- exclude borders
),
array [ ] -- in case of null
) || ts_hour -- attach current hour
) as t(ts_hour_exp)
group by name, ts_hour_exp
order by name, ts_hour_exp
Output:
name | hour | ok_perc |
---|---|---|
process_a | 2022-04-15 05:00:00.000 | 1.0 |
process_a | 2022-04-15 06:00:00.000 | 0.0 |
process_a | 2022-04-15 07:00:00.000 | 1.0 |
process_b | 2022-04-15 08:00:00.000 | 0.5 |
process_b | 2022-04-15 09:00:00.000 | 0.0 |
process_b | 2022-04-15 10:00:00.000 | 1.0 |