pythonpandasfeature-extractionfeaturetools

How can I use get_valid_primitives when I have only one dataframe in Featuretools?


I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I'll work here with only a set of it.

The dataframe is:

train={'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}}

I create an EntitySet for this dataframe:

es_train = ft.EntitySet()

I add the dataframe to the created EntitySet:

es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')

Then I call the function:

ap, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')

And here it all breaks up, because I get the following error message:

KeyError: 'DataFrame train does not exist in entity set'

I tried to study the tutorials on the Featuretools site, but all I could find are tutorials with multiple dataframes, so it didn't help me at all.

Where am I mistaking? How can I correct the mistake(s)?

Thanks!

Later edit: I am using PyCharm. When I work in script mode, I get the error above. However, when I use the command line, everything works perfectly.


Solution

  • The only issue I see with your code is that you're not wrapping your train object with pd.Dataframe

    This code works well for me:

    import featuretools as ft
    import pandas as pd
    
    train=pd.DataFrame({
        'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 
        'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 
        'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 
        'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 
        'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
    })
    
    es_train = ft.EntitySet()
    es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')
    
    _, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')
    
    
    for p in tp:
        print(p.name)