Below is the example code from the official docs
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["sum", "mode"],
trans_primitives=["cum_max", "month", "cum_count"],
max_depth=2
)
feature_defs
>>
[<Feature: zip_code>,
....
<Feature: MODE(sessions.device)>,
<Feature: MODE(transactions.sessions.device)>,
...
]
After analyzing the calculation of graph_feature()
, it looks like MODE(sessions.device)
and MODE(transactions.sessions.device)
are same even though they are calculated in different way. If I'm right, why does dfs calculate this redundantly?
Thanks for the question! While they look similar, these are actually different features. MODE(sessions.device)
is the mode of devices over all sessions for a customer while MODE(transactions.sessions.device)
is the mode of devices over all transactions for a customer.
As a quick example to demonstrate the difference, let's say a customer has 3 sessions:
session_id device
------------------------
A Mobile
B PC
C PC
There are also 5 transactions, each associated with one of these sessions:
transaction_id session_id sessions.device
--------------------------------------------------
0 A Mobile
1 A Mobile
2 A Mobile
3 B PC
4 C PC
In this case, the MODE(sessions.device)
would be PC, but the MODE(transactions.sessions.device)
would be Mobile because there's more transactions associated with Session A. In the feature graphs, the key difference is that MODE(transactions.sessions.device)
first joins on the transactions entity. Even if you group by sessions, you won't end up with what you started with since each transaction now has it's own value.