I have a transaction data with three columns including user_account, transaction_id and transaction_date respectively. I'd like to compute rolling cumulative counts of distinct transaction_id based on user_account and 24hrs time periods. An example is shown below.
user_account | transaction_date | transaction_id | cumulative_distinct_count |
---|---|---|---|
X0119989 | 2024-04-03 14:03:46 | G0000006 | 1 |
X0119989 | 2024-04-22 22:35:16 | G0000005 | 1 |
X0119989 | 2024-04-22 22:56:43 | G0000004 | 2 |
X0119989 | 2024-04-25 20:24:36 | G0000003 | 1 |
X0119989 | 2024-04-25 21:02:54 | G0000002 | 2 |
X0119989 | 2024-04-25 21:52:13 | G0000001 | 3 |
X0119999 | 2024-04-01 22:44:05 | G0000012 | 1 |
X0119999 | 2024-04-01 22:46:00 | G0000011 | 2 |
X0119999 | 2024-04-01 22:54:21 | G0000010 | 3 |
X0119999 | 2024-04-01 22:59:33 | G0000009 | 4 |
X0119999 | 2024-04-01 23:07:46 | G0000008 | 5 |
X0119999 | 2024-04-02 00:02:20 | G0000007 | 6 |
In the table above, the transaction id "G0000006" in the first row have 1 in columns "cumulative_distinct_count" is because there is no other transaction id except for itself happened over the past 24 hours of "2024/4/3 14:03:46".The transaction id "G0000004" in third row have 2 owing to the fact that over the past 24 hours of "2024/4/22 22:56:43" there are two transactions including "G0000004" and "G0000005" happened. What I'd like to do is to compute column "cumulative_distinct_count".
I'm currently find a undesired solution using apply method of pandas. The code is shown below.
def count_unique_id(x):
condition = (data['datetime'].between(x['datetime'] - dt.timedelta(days=1), x['datetime'])) & (data['user_account'] == x['user_account'])
return data[condition]['transaction_id'].nunique()
data['count_unique_id'] = data.swifter.apply(count_unique_id, axis=1)
The solution above took too much time for me due to large amount of data having over 3 million rows. I'm looking for other solutions which can run faster.
Thanks in advance.
There is no rolling.nunique
so you would have to use a groupby.rolling.apply
. Since rolling is only supported on numeric data, you can first factorize
your IDs.
# ensure datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
# we must sort
# by group since we'll later convert to array
# by transaction_date since rolling on datetime expected sorted timestamps
df.sort_values(by=['user_account', 'transaction_date'], inplace=True)
df['cumulative_distinct_count'] = (df
.assign(num_id=lambda d: pd.factorize(d['transaction_id'])[0])
.groupby('user_account', sort=False)
.rolling('24h', on='transaction_date')['num_id']
.apply(lambda x: x.nunique()).values
)
Output:
user_account transaction_date transaction_id cumulative_distinct_count
0 X0119989 2024-04-03 14:03:46 G0000006 1.0
1 X0119989 2024-04-22 22:35:16 G0000005 1.0
2 X0119989 2024-04-22 22:56:43 G0000004 2.0
3 X0119989 2024-04-25 20:24:36 G0000003 1.0
4 X0119989 2024-04-25 21:02:54 G0000002 2.0
5 X0119989 2024-04-25 21:52:13 G0000001 3.0
6 X0119999 2024-04-01 22:44:05 G0000012 1.0
7 X0119999 2024-04-01 22:46:00 G0000011 2.0
8 X0119999 2024-04-01 22:54:21 G0000010 3.0
9 X0119999 2024-04-01 22:59:33 G0000009 4.0
10 X0119999 2024-04-01 23:07:46 G0000008 5.0
11 X0119999 2024-04-02 00:02:20 G0000007 6.0