sqlpresto

PrestoSQL transform event data into hourly summary


I have a table containing events and timestamps

ts event name
1650000000 everything is ok process_a
1650003700 something is broken! process_a
1650007100 everything is ok process_a
1650010000 everything is ok process_b
1650013100 something is broken! process_b
1650017400 everything is ok process_b

I want to calculate, for every hour, what % of the hour "everything is ok" and what % of the hour "something is broken!", for each process. The final results should be something like

hour name ok_perc
... ... ...
2022-04-22 7:00:00 process_a .912
2022-04-22 8:00:00 process_a .634
2022-04-22 9:00:00 process_a 1
2022-04-22 1:00:00 process_b .354
2022-04-22 2:00:00 process_b .533
2022-04-22 3:00:00 process_b .987
... ... ...

I have worked on various subqueries to help me reach what I am trying to. The first problem I realised I was going to have is that I won't have all the hours, but only hours when there is a change of state. So I created:

    select 
        timestamp_column
    from
        (values 
            (sequence(cast('2022-01-01' as timestamp),  --don't bother that this doesn't match my pseudo timestamps in the events table
                      cast(now() as timestamp),
                      interval '1' hour
                     )
            )
        ) as t(timestamp_array)
    cross join 
        unnest(timestamp_array) as t1(timestamp_column)

The above gives me all the hour intervals where the events are going to happen.

Then I left joined the hourly timestamps with my events table on a date_trunc with the from_unixtime(ts), so if there is an event in that hour I have values else I have nulls. Note that there can be multiple events in an hour.

For some reason I thought that persisting the last event to the next hour, if the event in the next hour was null, was a good idea

coalesce
            (
                event,
                lag(event) ignore nulls over 
                    (
                        partition by name
                        order by timestamp_column
                    )
            )

And I also think something else that was going to help me figure this out was going to calculate the duration of each event, in this way (take the previous event timestamp minus the current event timestamp):

        lead(ts) over 
                (
                    partition by 
                        name
                    order by ts
                ) - ts as seconds_in_state

For some reason I thought I'd get to a point where I would know, in a given hour, how many seconds I was in one state and how many seconds I was in another. But myseconds_in_state are sometimes longer than an hour, which tells me I am not going in the right direction.

In general this seems a pretty common issue: how do I summarise an events table by a certain time interval, were each event has an implicit state of a thing name?

Somehow I got stuck on this and I am stubborn to fix it in Presto SQL rather than downloading the events data and do some manipulation in Python - it should definitely be possible!


Solution

  • Here is a possible approach - generate an hourly intervals between current and next timestamp for partition by name, then flatten the generated array with unnest and use group by name and hour to perform needed calculation:

    -- sample data
    WITH dataset (ts, event, name) AS (
        VALUES (1650000000, 'everything is ok', 'process_a'),
          (1650003700, 'something is broken!', 'process_a'),
          (1650007100,  'everything is ok', 'process_a'),
          (1650010000,  'everything is ok', 'process_b'),
          (1650013100,  'something is broken!', 'process_b'),
          (1650017400,  'everything is ok','process_b')
    ) 
    
    -- query
    select name, 
        ts_hour_exp hour, 
        count_if(is_ok) * 1.0 / count(*) ok_perc
    from (
            select date_trunc('hour', from_unixtime(ts)) ts_hour,
                if(event = 'everything is ok', true, false) is_ok, -- reduced strings to boolean flag 
                lead(date_trunc('hour', from_unixtime(ts))) over(
                    partition by name
                    order by ts
                ) next_ts,
                name
            from dataset
        )
    cross join unnest ( -- some magic for interval generation
            coalesce(
                array_except(
                    sequence(ts_hour, next_ts, interval '1' hour),
                    array [ ts_hour, next_ts ] -- exclude borders
                ),
                array [ ] -- in case of null
            ) || ts_hour -- attach current hour
        ) as t(ts_hour_exp)
    group by name, ts_hour_exp
    order by name, ts_hour_exp
    

    Output:

    name hour ok_perc
    process_a 2022-04-15 05:00:00.000 1.0
    process_a 2022-04-15 06:00:00.000 0.0
    process_a 2022-04-15 07:00:00.000 1.0
    process_b 2022-04-15 08:00:00.000 0.5
    process_b 2022-04-15 09:00:00.000 0.0
    process_b 2022-04-15 10:00:00.000 1.0