featuretools

calculate time-windowed profiles with featuretools dfs


i am having trouble understand the cutoff_dates concept. what i am really looking for is calculating different features by a time window that is let's say 60 days back (without the current transaction) , the cutoff_dates looks like hard coded dates in the examples. i am using time index for each row (A_time below), and according to the docs in here what_is_cutoff_datetime :

The time index is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored.

so it is not clear if i don't put the cutoff date the feature will be calculated until the time index value or not.

here is my entityset definition:

es = ft.EntitySet('payment')
es = es.entity_from_dataframe(entity_id='tableA',
                           dataframe=tableA_dfpd,
                           index='paymentIndex',
                           time_index='A_time')


es.normalize_entity(base_entity_id='tableA',
               new_entity_id='tableB',
               index='B_index',
               additional_variables=['B_x','B_time'],
               make_time_index='B_time')
               
es.normalize_entity(base_entity_id='tableA',
               new_entity_id='tableC',
               index='C_index',
               additional_variables=["C_x","C_date"],
               make_time_index="C_date")

es.normalize_entity(base_entity_id='tableA',
               new_entity_id='tableD',
               index='D_index',
               additional_variables=["D_x"],
               make_time_index=False)
               
Entityset: payment
  Entities:
    tableA [Rows: 310083, Columns: 8]
    tableB [Rows: 30296, Columns: 3]
    tableC [Rows: 206565, Columns: 3]
    tableD [Rows: 18493, Columns: 2]
  Relationships:
    tableA.B_index -> tableB.B_index
    tableA.C_index -> tableC.C_index
    tableA.D_index -> tableD.D_index

how exactly i can do the window calculation? do i need to pass the cutoff dates or not ? to dfs method ? i want to use all window calculations based on A_time variable, for a 60 days window up to current transaction, so actually the cutoff date for every transaction is the time_A value of that transaction. , isn't it ?


Solution

  • Thanks for the question. You can calculate features based on a time window by using a training window in DFS. You can also exclude transactions at the cutoff times by setting include_cutoff_time=False. I'll use this dataset of transactions to go through an example.

    import featuretools as ft
    
    df = ft.demo.load_mock_customer(return_single_table=True)
    df = df[['transaction_id', 'transaction_time', 'customer_id', 'amount']]
    df.sort_values(['customer_id', 'transaction_time'], inplace=True)
    df.head()
    
     transaction_id    transaction_time  customer_id  amount
                290 2014-01-01 00:44:25            1   21.35
                275 2014-01-01 00:45:30            1  108.11
                101 2014-01-01 00:46:35            1  112.53
                 80 2014-01-01 00:47:40            1    6.29
                484 2014-01-01 00:48:45            1   47.95
    

    First, we create an entity set for transactions and customers.

    es = ft.EntitySet()
    
    es.entity_from_dataframe(
        entity_id='transactions',
        index='transaction_id',
        time_index='transaction_time',
        dataframe=df,
    )
    
    es.normalize_entity(
        base_entity_id='transactions',
        new_entity_id='customers',
        index='customer_id',
    )
    
    es.add_last_time_indexes()
    
    Entityset: None
      Entities:
        transactions [Rows: 500, Columns: 4]
        customers [Rows: 5, Columns: 2]
      Relationships:
        transactions.customer_id -> customers.customer_id
    

    Then, we create a cutoff time at each transaction for each customer.

    cutoff_time = df[['customer_id', 'transaction_time']]
    cutoff_time['time'] = cutoff_time.pop('transaction_time')
    cutoff_time.head()
    
     customer_id                time
               1 2014-01-01 00:44:25
               1 2014-01-01 00:45:30
               1 2014-01-01 00:46:35
               1 2014-01-01 00:47:40
               1 2014-01-01 00:48:45
    

    Now, we can run DFS using a training window to calculate features based on a time window. In this example, we'll set the training window to 1 hour. This will include all transactions within 1 hour before the cutoff time for each customer.

    By default, transactions at the cutoff times are also included in the calculation. We can exclude those transactions by setting include_cutoff_time=False.

    fm, fd = ft.dfs(
        target_entity='customers',
        entityset=es,
        cutoff_time=cutoff_time,
        include_cutoff_time=False,
        cutoff_time_in_index=True,
        training_window='1h',
        trans_primitives=[],
        agg_primitives=['sum'],
        verbose=True,
    )
    
    fm.sort_index().head() 
    
                                     SUM(transactions.amount)
    customer_id time                                         
    1           2014-01-01 00:44:25                      0.00
                2014-01-01 00:45:30                     21.35
                2014-01-01 00:46:35                    129.46
                2014-01-01 00:47:40                    241.99
                2014-01-01 00:48:45                    248.28
    

    If the cutoff times are not passed to DFS, then all transactions for each customer are included in the calculation. Let me know if this helps.