amazon-athena

Different results, when using FILTER WHERE vs. WHERE in AWS Athena


I have a simple query that I'm trying to debug as it's giving me incorrect results:

WITH a AS (
    SELECT
        id,
        reference_date,
        date,
        COALESCE(failed_sum, 0) AS failed_sum,
        COALESCE(succ_sum, 0) AS succ_sum,
        COALESCE(failed_sum, 0) + COALESCE(succ_sum, 0) AS total_sum
    FROM transactions
    WHERE id = 'ABC'
)
SELECT id,
       SUM(failed_sum) / SUM(total_sum) FILTER (WHERE date >= DATE_ADD('day', -30, reference_date) AND date < reference_date) AS ratio
FROM a
GROUP BY 1;

This produced the value of 6, whereas I expected it to be <1 as it's a ratio of failed over total.

However, when I change the query like so:

WITH a AS (
    SELECT
        id,
        reference_date,
        date,
        COALESCE(failed_sum, 0) AS failed_sum,
        COALESCE(succ_sum, 0) AS succ_sum,
        COALESCE(failed_sum, 0) + COALESCE(succ_sum, 0) AS total_sum
    FROM transactions
    WHERE id = 'ABC'
)
SELECT id,
       SUM(failed_sum) / SUM(total_sum) AS ratio
FROM a
WHERE date >= DATE_ADD('day', -30, reference_date) AND date < reference_date
GROUP BY 1;

I get the expected result of 0.5. I expected the two queries to be identical, but I do see some differences in the execution plans between the two. Am I missing something obvious here?

Thanks in advance!


Solution

  • The difference here is that your first sum expression:

    SUM(failed_sum) / SUM(total_sum) FILTER (WHERE date >= DATE_ADD('day', -30, reference_date) AND date < reference_date) AS ratio
    

    will include all records, per id, across the entire table. The denominator of the expression will only sum total_sum values where the date has a certain value.

    On the other hand, the second query:

    SELECT id,
           SUM(failed_sum) / SUM(total_sum) AS ratio
    FROM a
    WHERE date >= DATE_ADD('day', -30, reference_date) AND date < reference_date
    GROUP BY 1;
    

    will actually remove records not matching the date requirement. This means the first SUM(failed_sum) will probably be less for many id groups, as some of the records would be dropped before the sum happens. The denominator values might be the same, but that won't matter because the quotients won't be in agreement with the first query.

    As to which version you want, it is up to your logic.