sqlamazon-athenaprestotrino

Amazon Athena - SQL Query to Return all rows for ID where one row meets a condition and does not meet a condition


I am trying to write a query to return ALL rows for an ID where a condition is met and a condition is not met for each ID on the Order table.

The conditions I want are to return all rows where the ID's where order was packed, and has not shipped yet.

status = 'order_packed'  
status <> 'order_shipped'

Order Table

ID status
1 order_received
1 order_processed
1 order_packed
1 order_shipped
2 order_received
2 order_processed
2 order_packed
2 order_canceled
3 order_received
3 order_processed
3 order_packed
4 order_received
4 order_processed
4 order_packed
4 order_hold
4 order_shipped
5 order_received
5 order_processed
6 order_received
6 order_processed
6 order_hold

I am expecting the output to be

Output

ID status
2 order_received
2 order_processed
2 order_packed
2 order_canceled
3 order_received
3 order_processed
3 order_packed

I have tried the below:

    SELECT
        id
        ,status

    FROM order_table
    WHERE
    id IN (SELECT DISTINCT id FROM order_table WHERE status IN ('order_packed')) 
    AND id IN (SELECT DISTINCT id FROM order_table WHERE status <> 'order_shipped') 

Any help is appreciated. Thanks


Solution

  • One of the options is to use a single subquery which will group the data by id and perform filtering via the having clause:

    SELECT
        id
        , status
    FROM order_table
    WHERE id IN (
        select id
        from order_table
        group by id
        having count_if(status = 'order_packed') > 0 -- exists packed
           AND count_if(status = 'order_shipped') = 0); -- no shipped
    

    Output:

    id status
    2 order_received
    2 order_processed
    2 order_packed
    2 order_canceled
    3 order_received
    3 order_processed
    3 order_packed

    Note that this highly likely will need to scan your data twice (use explain to check the plan). If you are ok with grouping the results into array you can do something like the following:

    SELECT
        id
        , statuses
    FROM (select id, array_agg(status) statuses
          from order_table
          group by id)
    where contains (statuses, 'order_packed') AND NOT contains(statuses, 'order_shipped')
    

    Output:

    id statuses
    3 [order_received, order_processed, order_packed]
    2 [order_received, order_processed, order_packed, order_canceled]