oracle-databasesnowflake-cloud-data-platform

snowflake select max date from date array


Imagine I have a table with some field one of which is array off date. as below

col1 col2 alldate                       Max_date
1    2    ["2021-02-12","2021-02-13"]   "2021-02-13"
2    3    ["2021-01-12","2021-02-13"]   "2021-02-13"
4    4    ["2021-01-12"]                "2021-01-12"
5    3    ["2021-01-11","2021-02-12"]   "2021-02-12"
6    7    ["2021-02-13"]                "2021-02-13"

I need to write a query such that to select only the one which has max date in there array. And there is a column which has max date as well.

Like the select statement should show

col1 col2 alldate                       Max_date
1    2    ["2021-02-12","2021-02-13"]   "2021-02-13"
2    3    ["2021-01-12","2021-02-13"]   "2021-02-13"
6    7    ["2021-02-13"]                "2021-02-13"

The table is huge so a optimized query is needed. Till now I was thinking of

select col1, col2, maxdate
       from t1 where array_contains((select max(max_date) from t1)::variant,date));

But to me it seems running select statement per query is a bad idea.

Any Suggestion


Solution

  • If you want pure speed using lateral flatten is 10% faster than the array_contains approach over 500,000,000 records on a XS warehouse. You can copy paste below code straight into snowflake to test for yourself.

    Why is the lateral flattern approach faster? Well if you look at the query plans the optimiser filters at first step (immediately culling records) where as the array_contains waits until the 4th step before doing the same. The filter is the qualifier of the max(max_date) ...

    Create Random Dataset:

    create or replace table stack_overflow_68132958 as
    SELECT
        seq4() col_1,
        UNIFORM (1, 500, random()) col_2,
        DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_1,
        DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_2,
        DATEADD(day, UNIFORM (-40, -0, random()), current_date()) random_date_3,
        ARRAY_CONSTRUCT(random_date_1, random_date_2, random_date_3) date_array,
        greatest(random_date_1, random_date_2, random_date_3) max_date,
        to_array(greatest(random_date_1, random_date_2, random_date_3)) max_date_array
    FROM
    TABLE (GENERATOR (ROWCOUNT => 500000000))  ;
    

    Test Felipe/Mike approach -> 51secs

    select 
     distinct 
     col_1
    ,col_2  
    from 
      stack_overflow_68132958 
    qualify 
     array_contains(max(max_date) over () :: variant, date_array); 
    

    Test Adrian approach -> 47 secs

    select 
      distinct    
      col_1
    , col_2 
    from 
       stack_overflow_68132958
     , lateral flatten(input => date_array) g 
    qualify 
        max(max_date) over () = g.value;
    

    enter image description here

    enter image description here