sqlamazon-redshiftaginity

RFM Analysis - Value keeps changing


Got the following query going.

SELECT
    customer_id,
    NTILE(5) OVER (ORDER BY MAX(oms_order_date)) AS r_score
FROM 
    mdwh.us_raw.l_dmw_order_report
WHERE 
    quantity_ordered > 0
    AND customer_id IS NOT NULL
    AND customer_id != ('')
    AND UPPER(line_status) NOT IN ('','RETURN', 'CANCELLED')
    AND UPPER(item_description_1) NOT IN ('','FREIGHT', 'RETURN LABEL FEE', 'VISIBLE STITCH')
    AND (quantity_ordered * unit_price_amount) > 0
    AND extended_amount < 1000 --NO BULK ORDERS
    AND oms_order_date BETWEEN '2020-01-01' AND '2020-01-01'
    AND SUBSTRING(upc,1,6) IN (SELECT item_code FROM item_master_zs WHERE new_division BETWEEN '11' AND '39')
GROUP BY
    customer_id
ORDER BY
    customer_id

All I'm doing here is, given some conditions, to give me the unique customer ID, then cluster their latest purchase date into quintiles and provide me with a score in the second column. But everytime I run the query, the r_score value keeps changing? What am I doing wrong..? Here's a snippet of what the table looks like (again, r_score value keeps changing):

enter image description here


Solution

  • The problem with ntile() is that it ensures that the groups are exactly the same size by putting the same value across different groups.

    For this reason, I usually do the calculation manually, using rank():

    ceil(rank() over (order by max(oms_order_date)) * 5.0 /
         count(*) over ()
        ) as r_score
    

    If you use row_number(), you'll get the equivalent of ntile().

    If you want to use ntile(), you can use additional order by keys so the sorting keys are unique.

    ===================

    2/17/20 5:18PM EDIT

    Here is the new code I'm using:

    SELECT
        customer_id,
        CEIL(RANK() OVER (ORDER BY MAX(oms_order_date)) * 5 / COUNT(*) OVER ()) AS r_score,
        CEIL(RANK() OVER (ORDER BY COUNT(client_web_order_number)) * 5 / COUNT(*) OVER ()) AS f_score,
        CEIL(RANK() OVER (ORDER BY AVG(extended_amount)) * 5 / COUNT(*) OVER ()) AS m_score,
        (r_score || f_score || m_score) AS rfm_score
    FROM 
        mdwh.us_raw.l_dmw_order_report t1
    WHERE 
        quantity_ordered > 0
        AND customer_id IS NOT NULL
        AND customer_id != ('')
        AND oms_order_date IS NOT NULL
        AND UPPER(line_status) NOT IN ('','RETURN', 'CANCELLED')
        AND UPPER(item_description_1) NOT IN ('','FREIGHT', 'RETURN LABEL FEE', 'VISIBLE STITCH')
        AND (quantity_ordered * unit_price_amount) > 0
        AND extended_amount < 1000 --NO BULK ORDERS
        AND oms_order_date BETWEEN '2020-01-01' AND '2020-01-10'
        AND SUBSTRING(upc,1,6) IN (SELECT item_code FROM item_master_zs WHERE new_division BETWEEN '11' AND '39')
    GROUP BY
        customer_id
    ORDER BY
        customer_id
    

    The issue now is that I'm getting some row's with a blank r_score, and the max value is 4 instead of 5..