sqlpostgresqldateintervalgenerate-series

Postgresql generate date series (performance)


Using postgresql version > 10, I have come to an issue when generating date series using the built-in generate_series function. In essence, it does not accord for the day of the month correctly.

I have many different frequencies (provided by the user) that need to be calculated between a given start and end date. The start date can be any date and thus any day of the month. This generates issues when having frequencies such as monthly combined with a start date of 2018-01-31 or 2018-01-30 as shown in the output below.

I created a solution and wanted to post this here for others to use as I could not find any other solution.

However, after some tests I have seen that my solution has a different performance compared to the built-in generate_series when used on (absurdly) large date ranges. Does anyone have an insight as to how this can be improved?

TL;DR: if possible avoid loops as they are a performance hit, scroll to bottom for improved implementation.

Built-in Output

select generate_series(date '2018-01-31', 
                       date '2018-05-31', 
                       interval '1 month')::date
as frequency;

generates:

 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-28
 2018-04-28
 2018-05-28

As can be seen from the output, the day of the month is not respected and truncated to the minimum day encountered along the way, in this case: 28 due to the month of februari.

Expected Output

As a result of this issue I created a custom function:

create or replace function generate_date_series(
  starts_on date, 
  ends_on date, 
  frequency interval)
returns setof date as $$
declare
  interval_on date := starts_on;
  count int := 1;
begin
  while interval_on <= ends_on loop
    return next interval_on;
    interval_on := starts_on + (count * frequency);
    count := count + 1;
  end loop;
  return;
end;
$$ language plpgsql immutable;

select generate_date_series(date '2018-01-31', 
                            date '2018-05-31', 
                            interval '1 month')
as frequency;

generates:

 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-31
 2018-04-30
 2018-05-31

Performance comparison

No matter what date range is provided, the built-in generate_series has a performance of 2ms on average for:

select generate_series(date '1900-01-01', 
                       date '10000-5-31', 
                       interval '1 month')::date 
as frequency;

while the custom function generate_date_series has a performance of 120ms on average for:

select generate_date_series(date '1900-01-01', 
                            date '10000-5-31', 
                            interval '1 month')::date 
as frequency;

Question

In reality, such ranges will never occur and thus it is a non-issue. For most queries the custom generate_date_series will attain the same performance. Although, I do wonder what causes the difference.

Is there a reason why the built-in function is able to attain a constant performance of 2ms on average no matter what range is provided?

Is there a better way to implement generate_date_series that performs as well as the built-in generate_series?

Improved implementation without loops

(derived from the answer of @eurotrash)

create or replace function generate_date_series(
  starts_on date, 
  ends_on date, 
  frequency interval)
returns setof date as $$
select (starts_on + (frequency * count))::date
from (
  select (row_number() over ()) - 1 as count
  from generate_series(starts_on, ends_on, frequency)
) series
$$ language sql immutable;

with the improved implementation, the generate_date_series function has a performance of 45ms on average for:

select generate_date_series(date '1900-01-01', 
                            date '10000-5-31', 
                            interval '1 month')::date 
as frequency;

The implementation provided by @eurotrash gives me 80ms on average, which I assume is due to calling the generate_series function twice.


Solution

  • Why your function is slow: you use variables and (more importantly) a loop. Loops are slow. Variables also mean reads from and writes to those variables.

    CREATE OR REPLACE FUNCTION generate_date_series_2(starts_on DATE, ends_on DATE, frequency INTERVAL)
            RETURNS SETOF DATE AS
    $BODY$
            SELECT (starts_on + (frequency * g))::DATE
            FROM generate_series(0, (SELECT COUNT(*)::INTEGER - 1 FROM generate_series(starts_on, ends_on, frequency))) g;
    $BODY$
            LANGUAGE SQL IMMUTABLE;
    

    The concept is basically the same as your plpgsql function but via a single query instead of a loop. The only problem is deciding how many iterations are needed (i.e. the second parameter to generate_series). Sadly I couldn't think of a better way to get the number of intervals required other than calling generate_series for the dates and using the count of that. Of course if you know your intervals will only ever be certain values then it may be possible to optimise; however this version handles any interval values.

    On my system it's about 50% slower than a pure generate_series, and about 400% faster than your plpgsql version.