postgresqlsimulationgenerate-series

PostgreSQL - GENERATE_SERIES issue - simulating data


I thought I understood GENERATE_SERIES(), but obviously not as well as I thought.

I have a table:

CREATE TABLE status
(
  start_tz  TIMESTAMPTZ NOT NULL,
  end_tz    TIMESTAMPTZ NOT NULL,
  ms        TEXT        NOT NULL,
  sensor_id SMALLINT    NOT NULL,
);

And I wish to fill this table with simulated data from 2022-01-01 00:00:00 in 20 minute intervals for 4 sensors (numbered 1 - 4) for 2 hours (or whatever - no hardcoding). The ms text is random - see below.

I have tried various combinations of

SELECT
  GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')),
  GENERATE_SERIES('2022-01-01 00:20:00', '2022-01-01 02:20:00', (INTERVAL '20 MINUTES')),
  CASE
     WHEN random() > 0.5 THEN 'in_motion)'
    ELSE                     'stationary'
  END,
  GENERATE_SERIES(1, 4);

and just can't get it. I also have various cominations and permutations of FROM GENERATE_SERIES(1, ... whatever guess...); and I'm stumped.

What I want looks like this:

start_tz                 end_tz                 ms      sensor_id
2022-01-01 00:00:00+00  2022-01-01 00:20:00+00  stationary  1
2022-01-01 00:20:00+00  2022-01-01 00:40:00+00  stationary  1
2022-01-01 00:40:00+00  2022-01-01 01:00:00+00  in_motion)  1
2022-01-01 01:00:00+00  2022-01-01 01:20:00+00  in_motion)  1
2022-01-01 01:20:00+00  2022-01-01 01:40:00+00  stationary  1
2022-01-01 01:40:00+00  2022-01-01 02:00:00+00  in_motion)  1
2022-01-01 02:00:00+00  2022-01-01 02:20:00+00  stationary  1
2022-01-01 00:00:00+00  2022-01-01 00:20:00+00  stationary  2
2022-01-01 00:20:00+00  2022-01-01 00:40:00+00  stationary  2
2022-01-01 00:40:00+00  2022-01-01 01:00:00+00  in_motion)  2
...
...
and so on

A fiddle is available here. The correct answer will give some explanation of where I was going wrong - I don't want to be given a fish, I would like to be taught how to do a little fishing of my own!


Solution

  • The behavior you want would have been produced in 9.6, but perhaps only by accident. Back then multiple SRF in the select list would be cycled until all of them were exhausted at the same time, returning a number of rows which was the Least Common Multiple of each isolated row counts. Because 7 is prime, this would do what you apparently want. This was changed in v10, as a known side effect of other refactoring I think, so that each SRF was cycle just once with the "short" ones being augmented with NULLs.

    To get what you want you can put one of the SRF into the FROM list:

    SELECT
      GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')),
      GENERATE_SERIES('2022-01-01 00:20:00', '2022-01-01 02:20:00', (INTERVAL '20 MINUTES')),
      CASE
         WHEN random() > 0.5 THEN 'in_motion)'
        ELSE                     'stationary'
      END, 
      f 
      FROM 
      GENERATE_SERIES(1, 4) f(f);
    

    But I think it is still discouraged to put SRF in the SELECT-list. The problem is that putting both of the time ones in the FROM-list would return 7*7*4 rows, not LCM(7,7,4) rows. To get around that, you would reduce it to one SRF for the time, and use an expression to get the parallel offset time point.

    SELECT
          s,
          s+INTERVAL '20 MINUTES',
          CASE
             WHEN random() > 0.5 THEN 'in_motion)'
            ELSE                     'stationary'
          END, 
          f 
          FROM GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')) s(s), 
          GENERATE_SERIES(1, 4) f(f);
    

    So this is the way I would do it. It is also less repetitive, repeating only the interval rather than the offsat start, offsat stop, and interval.