Using postgresql version > 10, I have come to an issue when generating date series using the built-in generate_series
function. In essence, it does not accord for the day of the month
correctly.
I have many different frequencies (provided by the user) that need to be calculated between a given start and end date. The start date can be any date and thus any day of the month. This generates issues when having frequencies such as monthly
combined with a start date of 2018-01-31
or 2018-01-30
as shown in the output below.
I created a solution and wanted to post this here for others to use as I could not find any other solution.
However, after some tests I have seen that my solution has a different performance compared to the built-in generate_series
when used on (absurdly) large date ranges. Does anyone have an insight as to how this can be improved?
TL;DR: if possible avoid loops as they are a performance hit, scroll to bottom for improved implementation.
select generate_series(date '2018-01-31',
date '2018-05-31',
interval '1 month')::date
as frequency;
generates:
frequency
------------
2018-01-31
2018-02-28
2018-03-28
2018-04-28
2018-05-28
As can be seen from the output, the day of the month is not respected and truncated to the minimum day encountered along the way, in this case: 28 due to the month of februari
.
As a result of this issue I created a custom function:
create or replace function generate_date_series(
starts_on date,
ends_on date,
frequency interval)
returns setof date as $$
declare
interval_on date := starts_on;
count int := 1;
begin
while interval_on <= ends_on loop
return next interval_on;
interval_on := starts_on + (count * frequency);
count := count + 1;
end loop;
return;
end;
$$ language plpgsql immutable;
select generate_date_series(date '2018-01-31',
date '2018-05-31',
interval '1 month')
as frequency;
generates:
frequency
------------
2018-01-31
2018-02-28
2018-03-31
2018-04-30
2018-05-31
No matter what date range is provided, the built-in generate_series
has a performance of 2ms on average for:
select generate_series(date '1900-01-01',
date '10000-5-31',
interval '1 month')::date
as frequency;
while the custom function generate_date_series
has a performance of 120ms on average for:
select generate_date_series(date '1900-01-01',
date '10000-5-31',
interval '1 month')::date
as frequency;
In reality, such ranges will never occur and thus it is a non-issue. For most queries the custom generate_date_series
will attain the same performance. Although, I do wonder what causes the difference.
Is there a reason why the built-in function is able to attain a constant performance of 2ms on average no matter what range is provided?
Is there a better way to implement generate_date_series
that performs as well as the built-in generate_series
?
(derived from the answer of @eurotrash)
create or replace function generate_date_series(
starts_on date,
ends_on date,
frequency interval)
returns setof date as $$
select (starts_on + (frequency * count))::date
from (
select (row_number() over ()) - 1 as count
from generate_series(starts_on, ends_on, frequency)
) series
$$ language sql immutable;
with the improved implementation, the generate_date_series
function has a performance of 45ms on average for:
select generate_date_series(date '1900-01-01',
date '10000-5-31',
interval '1 month')::date
as frequency;
The implementation provided by @eurotrash gives me 80ms on average, which I assume is due to calling the generate_series
function twice.
Why your function is slow: you use variables and (more importantly) a loop. Loops are slow. Variables also mean reads from and writes to those variables.
CREATE OR REPLACE FUNCTION generate_date_series_2(starts_on DATE, ends_on DATE, frequency INTERVAL)
RETURNS SETOF DATE AS
$BODY$
SELECT (starts_on + (frequency * g))::DATE
FROM generate_series(0, (SELECT COUNT(*)::INTEGER - 1 FROM generate_series(starts_on, ends_on, frequency))) g;
$BODY$
LANGUAGE SQL IMMUTABLE;
The concept is basically the same as your plpgsql function but via a single query instead of a loop. The only problem is deciding how many iterations are needed (i.e. the second parameter to generate_series). Sadly I couldn't think of a better way to get the number of intervals required other than calling generate_series for the dates and using the count of that. Of course if you know your intervals will only ever be certain values then it may be possible to optimise; however this version handles any interval values.
On my system it's about 50% slower than a pure generate_series, and about 400% faster than your plpgsql version.