sqlamazon-athenaprestotrino

Issue with using contains array function in SQL presto


trying to get the following SQL Presto code to work so that I can perform business day calculations. It works if I write "d -> day_of_week(d) not in (6,7)" but I also need to filter out holidays from a calendar hence the "and contains(array_agg(date), d)" clause which does not work. I have also tried contains(date, d) but this failed as well. Please advise.

WITH dataset(start_date, end_date) AS (
 values     (date '2021-10-01', date '2021-10-05'),
    (date '2021-10-01', date '2021-10-03'),
    (date '2021-10-02', date '2021-10-10'),
    (date '2021-10-02', date '2021-10-08'),
    (date '2021-10-02', date '2021-10-05')
),
holidays (row, date) AS (
    values (1,date '2021-10-08'))

select start_date,
       end_date,
       cardinality(filter(
               sequence(start_date, end_date, interval '1' day),
                d -> day_of_week(d) not in (6,7) and not contains(array_agg(date), d) 
           )) business_days
from dataset, holidays

Solution

  • You can flatten the data, filter and then aggregate it back:

    -- sample data
    WITH dataset(start_date, end_date) AS (
     values     (date '2021-10-01', date '2021-10-05'),
        (date '2021-10-01', date '2021-10-03'),
        (date '2021-10-02', date '2021-10-10'),
        (date '2021-10-02', date '2021-10-08'),
        (date '2021-10-02', date '2021-10-05')
    ),
    holidays (row, date) AS (
        values (1,date '2021-10-08')),
    -- query
    with_ids as (
        select start_date,
               end_date,
               row_number() over () rn
        from dataset
    )
    
    select arbitrary(start_date) start_date, 
        arbitrary(end_date) end_date, 
        array_agg(day) business_days
    from with_ids, 
        unnest(sequence(start_date, end_date, interval '1' day)) as t(day)
    where day not in (select date from holidays)
    group by rn;
    

    Output:

    start_date end_date business_days
    2021-10-01 2021-10-05 [2021-10-01, 2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05]
    2021-10-02 2021-10-10 [2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05, 2021-10-06, 2021-10-07, 2021-10-09, 2021-10-10]
    2021-10-02 2021-10-08 [2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05, 2021-10-06, 2021-10-07]
    2021-10-02 2021-10-05 [2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05]
    2021-10-01 2021-10-03 [2021-10-01, 2021-10-02, 2021-10-03]

    The only problem here is that this will result in deletion of ranges without business days (you can bring them back by left joining the original data).