sqlsnowflake-cloud-data-platform

Why is Snowflake returning records where complex WHERE clause is FALSE?


I have this query in Snowflake (which runs on a generator so it doesn't need a table if you want to just run it):

SELECT
    DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) AS date_column,
    DATEADD(week, -98, CURRENT_DATE) as anchor_date,
    current_date,
    DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) BETWEEN DATEADD(week, -98, CURRENT_DATE) AND CURRENT_DATE as truthy
FROM
    TABLE(GENERATOR(ROWCOUNT => 5000))
WHERE
    DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) BETWEEN DATEADD(week, -98, CURRENT_DATE) AND CURRENT_DATE
order by date_column

Which returns these records (and more):

date_column anchor_date current_date truthy
2022-11-13 2022-12-13 2024-10-29 FALSE
2022-11-20 2022-12-13 2024-10-29 FALSE
2022-11-27 2022-12-13 2024-10-29 FALSE
2022-12-04 2022-12-13 2024-10-29 FALSE
2022-12-11 2022-12-13 2024-10-29 FALSE
2022-12-18 2022-12-13 2024-10-29 TRUE
2022-12-25 2022-12-13 2024-10-29 TRUE
2023-01-01 2022-12-13 2024-10-29 TRUE
2023-01-08 2022-12-13 2024-10-29 TRUE
2023-01-15 2022-12-13 2024-10-29 TRUE

Clearly it knows that the WHERE clause is FALSE for the first five records, but it returns those records anyway. Why? I'm guessing the table-free / generator setup may have something to do with it, but I can't figure out why.

I know I could wrap this in a SELECT * FROM () where truthy; - solving the problem isn't so much the issue as understanding why this doesn't work. What's going on?


Solution

  • so first to prove the SEQ4 are stable in the select as implied:

    SELECT
        SEQ4() * 7 as a,
        SEQ4() * 7 as b,
    FROM TABLE(GENERATOR(ROWCOUNT => 5000))
    WHERE A != b
    

    Query produced no results

    good

    now to use the results of the select region in the where:

    SELECT
        DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) AS date_column,
        DATEADD(week, -98, CURRENT_DATE) as anchor_date,
        current_date,
        date_column BETWEEN anchor_date AND CURRENT_DATE as truthy
    FROM TABLE(GENERATOR(ROWCOUNT => 5000))
    WHERE truthy
    order by date_column
    

    no unwanted results

    change back to using a seperate expression in the where clause (how you have it):

    SELECT
        DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) AS date_column,
        DATEADD(week, -98, CURRENT_DATE) as anchor_date,
        current_date,
        date_column BETWEEN anchor_date AND CURRENT_DATE as truthy
    FROM TABLE(GENERATOR(ROWCOUNT => 5000))
    WHERE DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) BETWEEN DATEADD(week, -98, CURRENT_DATE) AND CURRENT_DATE
    order by date_column
    

    unwanted results

    I am going to guess the independant code is the same as explicitly using a sub-select:

    SELECT
        DATEADD(day, SEQ4() * 7, '2022-11-13'::DATE) AS date_column,
        DATEADD(week, -98, CURRENT_DATE) as anchor_date,
        current_date,
        date_column BETWEEN anchor_date AND CURRENT_DATE as truthy
    FROM (
        select SEQ4() as sub_seq
        from TABLE(GENERATOR(ROWCOUNT => 5000))
    ) WHERE DATEADD(day, sub_seq * 7, '2022-11-13'::DATE) BETWEEN DATEADD(week, -98, CURRENT_DATE) AND CURRENT_DATE
    order by date_column
    

    which has unwanted results which can be seen like this:

    SELECT
        SEQ4() as a,
        a > 1000 as assumptions
    FROM (
        select SEQ4() as b
        from TABLE(GENERATOR(ROWCOUNT => 5000))
    ) 
    WHERE b > 1000
    order by 2;
    

    things can be indepentant

    Now if the where clause here was a != b there are no unwanted results, as the rows are only processed once.