pandasgroup-byapplytimedeltacumsum

how to use pandas to compute rolling cumulative distinct count over past 24 hrs?


I have a transaction data with three columns including user_account, transaction_id and transaction_date respectively. I'd like to compute rolling cumulative counts of distinct transaction_id based on user_account and 24hrs time periods. An example is shown below.

data

user_account transaction_date transaction_id cumulative_distinct_count
X0119989 2024-04-03 14:03:46 G0000006 1
X0119989 2024-04-22 22:35:16 G0000005 1
X0119989 2024-04-22 22:56:43 G0000004 2
X0119989 2024-04-25 20:24:36 G0000003 1
X0119989 2024-04-25 21:02:54 G0000002 2
X0119989 2024-04-25 21:52:13 G0000001 3
X0119999 2024-04-01 22:44:05 G0000012 1
X0119999 2024-04-01 22:46:00 G0000011 2
X0119999 2024-04-01 22:54:21 G0000010 3
X0119999 2024-04-01 22:59:33 G0000009 4
X0119999 2024-04-01 23:07:46 G0000008 5
X0119999 2024-04-02 00:02:20 G0000007 6

In the table above, the transaction id "G0000006" in the first row have 1 in columns "cumulative_distinct_count" is because there is no other transaction id except for itself happened over the past 24 hours of "2024/4/3 14:03:46".The transaction id "G0000004" in third row have 2 owing to the fact that over the past 24 hours of "2024/4/22 22:56:43" there are two transactions including "G0000004" and "G0000005" happened. What I'd like to do is to compute column "cumulative_distinct_count".

I'm currently find a undesired solution using apply method of pandas. The code is shown below.

def count_unique_id(x):
    condition = (data['datetime'].between(x['datetime'] - dt.timedelta(days=1), x['datetime'])) & (data['user_account'] == x['user_account'])
    return data[condition]['transaction_id'].nunique()


data['count_unique_id'] = data.swifter.apply(count_unique_id, axis=1)

The solution above took too much time for me due to large amount of data having over 3 million rows. I'm looking for other solutions which can run faster.

Thanks in advance.


Solution

  • There is no rolling.nunique so you would have to use a groupby.rolling.apply. Since rolling is only supported on numeric data, you can first factorize your IDs.

    # ensure datetime
    df['transaction_date'] = pd.to_datetime(df['transaction_date'])
    
    # we must sort
    # by group since we'll later convert to array
    # by transaction_date since rolling on datetime expected sorted timestamps
    df.sort_values(by=['user_account', 'transaction_date'], inplace=True)
    
    df['cumulative_distinct_count'] = (df
       .assign(num_id=lambda d: pd.factorize(d['transaction_id'])[0])
       .groupby('user_account', sort=False)
       .rolling('24h', on='transaction_date')['num_id']
       .apply(lambda x: x.nunique()).values
    )
    

    Output:

       user_account    transaction_date transaction_id  cumulative_distinct_count
    0      X0119989 2024-04-03 14:03:46       G0000006                        1.0
    1      X0119989 2024-04-22 22:35:16       G0000005                        1.0
    2      X0119989 2024-04-22 22:56:43       G0000004                        2.0
    3      X0119989 2024-04-25 20:24:36       G0000003                        1.0
    4      X0119989 2024-04-25 21:02:54       G0000002                        2.0
    5      X0119989 2024-04-25 21:52:13       G0000001                        3.0
    6      X0119999 2024-04-01 22:44:05       G0000012                        1.0
    7      X0119999 2024-04-01 22:46:00       G0000011                        2.0
    8      X0119999 2024-04-01 22:54:21       G0000010                        3.0
    9      X0119999 2024-04-01 22:59:33       G0000009                        4.0
    10     X0119999 2024-04-01 23:07:46       G0000008                        5.0
    11     X0119999 2024-04-02 00:02:20       G0000007                        6.0