sqlpostgresqltime-serieswindow-functionsgaps-and-islands

PostgresSql backfill data if doesn't exist for a start period


Let's assume I have a table with 3 columns: Id, Value, date Time. I need to find the min, max, first and last value of each 1 minute window between the provided start and end time. And in case no row exists for the start time I need to backfill the last value before the start for it and even for any missing window use the last value.

e.g. Let's assume a table, simplifying to just include time:


    Id Time    Value
    1 10:10:05 3
    1 10:11:06 4
    1 10:13:13 5
    1 10:13:19 9
    1 10:13:32 8
    1 10:14:35 2 

And if I want the results from start time 10:12:00 to end time 10:14:00.

It will give for a window for 12-13 min as the following information:

start value = 4 (backfill from the last value in data from the preceeding value) 
end value = 4
min value = 4
max value = 4
minute start = 10:12:00
minute end = 10:13:00

For a window for 13-14 min as the following information:

start value = 4 (backfill from the last value in data from the preceeding value) 
end value = 8
min value = 4
max value = 9
minute start = 10:13:00
minute end = 10:14:00

Basically for any window use the last window value.

Is there a query that can do it? If the above query is quite complex can we at least do a query which simple gives the values between the provided start and end time however if no row exists for the start time, it simple gives the last value before the start time. Basically, I always have a value for start time.

SO for example if I ask for values between start and end time of 10:12:00 and 10:14:00 it will give the following values:


    1 10:12:00 4 (back fill from last value)
    1 10:13:13 5
    1 10:13:19 9
    1 10:13:32 8

Rest I'll do programmatically.


Solution

  • Demo at db<>fiddle:

    with param(target_range) as (values ('[10:12:00,10:14:00]'::trange))
    ,add_missing as (
        select date_trunc('minute',g-'1 minute'::interval)::time as start,
               date_trunc('minute',g)::time as end,
               (array_agg("Value" order by "Time"))[count(*)] end_value,
               min("Value"),
               max("Value")
        from param
        cross join generate_series(current_date + lower(target_range),
                                   current_date + upper(target_range),
                                   '1 minute')g
        left join your_table t on date_trunc('minute',g-'1 minute'::interval)::time
                                 =date_trunc('minute',t."Time")
                              and g::time <@ target_range
        group by 1,2)
    select "start","end"
          ,((jsonb_agg(end_value)filter(where end_value is not null)over w1)->>-1
           )::int as start_value
          ,((jsonb_agg(end_value)filter(where end_value is not null)over w1)->>-1
           )::int as end_value
          ,((jsonb_agg(min)filter(where min is not null)over w1)->>-1
           )::int as min
          ,((jsonb_agg(max)filter(where max is not null)over w1)->>-1
           )::int as max
    from add_missing
    window w1 as (order by start)
    offset 1;
    
    start end start_value end_value min max
    10:12:00 10:13:00 4 4 4 4
    10:13:00 10:14:00 8 8 5 9
    1. There's a gaps-and-islands aspect to your problem.
    2. You can use generate_series() to generate 1-minute slots in your range, then left join on the matching minutes.
    3. The left join will produce all-null rows wherever you had nothing in that window.
    4. Normally, you'd use coalesce(val,lag(val)over w1) to get the "previous" value in those all-null rows. Problem is, PostgreSQL doesn't have skip null clause and it could happen that the previous minute you'd want values from, is also empty. To address that, I aggregate all earlier values into an array, filter to skip null, and ask for the last element.
    5. Native arrays don't support negative subscripts, which means the last element can't be taken as arr[-1]. You'd have to either add an intermediate cte to first build the arrays in a cte/subquery, and reuse their name to do arr[array_upper(arr)], or use jsonb arrays instead - jsonb arrays support negative subscripts, but you have to cast from their resulting text type back to your initial int. With native arrays you can also do (array_agg()over w1)[count(*)over w1] similar to (array_agg())[count(*)].
    6. I took your time-looking values literally, which is why I used a time-based range type. If that's just an example, this can be changed to use proper timestamptz and tstzrange.
    7. OFFSET 1 at the end skips the additional, initial row that you have to have to be able to backfill from it, but don't want it in your final input, since it precedes your target range.