pythonpandassequencehistorical-db

How can I assign chronological order position or each customers purchase history using Python


I have a data set of all customers purchases made in the last 5 years with the below columns as an example.

CUSTOMER ID | ORDER NUMBER | ORDER DATE (DateTime) | ITEM ID

What I need to do is assign each individual customer order position (i.e 1,2,3,4) - So for each customer I am able to know which order was their 1st 2nd or 3rd using the above criteria

Things to consider:

  1. There are multiple Customer ID's and Order Numbers in the same table
  2. There are multiple rows from the same customer & order number combination as each order has multiple items, so if the first order for a customer has 3 items on it I want all 3 items to show as 1.

Struggling to find a starting point of how to do this

Below is the data table:

import pandas as pd
df = pd.DataFrame({'Customer ID' : ['C100003','C100002','C100002','C100002','C100003','C100002'],
                    'Order Number' : ['RJSJ0053','RJSJ0060','RJSJ0085','RJSJ0085','RJSJ0089','RJSJ0092'],
                    'Order Date' : ['2023-05-08','2023-06-09','2023-06-13','2023-06-13','2023-06-13','2023-06-14'],
                    'Item ID' : ['Mouse','Keyboard','Computer','Monitor','Keyboard','Headset'],
                    'Order Position' : [1,1,2,2,2,3]})

Solution

  • First, remove all duplicate [Customer ID, Order Number]. This will get you a table that looks like this:

      Customer ID Order Number  Order Date   Item ID
    0     C100003     RJSJ0053  2023-05-08     Mouse
    1     C100002     RJSJ0060  2023-06-09  Keyboard
    2     C100002     RJSJ0085  2023-06-13  Computer
    4     C100003     RJSJ0089  2023-06-13  Keyboard
    5     C100002     RJSJ0092  2023-06-14   Headset
    

    Now we can sort this by [Customer ID, Order Date], create a cumulative count grouped by Customer ID, then fill forward the missing values since those are multiple items in the same order.

    df['Order Position'] = (df.drop_duplicates(['Customer ID', 'Order Number'])
                              .sort_values(['Customer ID', 'Order Date'])
                              .groupby('Customer ID')
                              .cumcount()+1
                           )
    
    df['Order Position'] = (df.sort_values(['Customer ID', 'Order Date'])['Order Position']
                              .fillna(method='ffill')
                              .astype(int)
                           )
    
      Customer ID Order Number  Order Date   Item ID  Order Position
    0     C100003     RJSJ0053  2023-05-08     Mouse               1
    1     C100002     RJSJ0060  2023-06-09  Keyboard               1
    2     C100002     RJSJ0085  2023-06-13  Computer               2
    3     C100002     RJSJ0085  2023-06-13   Monitor               2
    4     C100003     RJSJ0089  2023-06-13  Keyboard               2
    5     C100002     RJSJ0092  2023-06-14   Headset               3