pythonpandasfeaturetools

Create integer unique keys in 3 dataframes for rows with same names to generate automatic features using featuretools


I have three different data frames with basketball players' data.

In all three dataframes there are basketball players' names. I want to join all three dataframes into one EntitySet to use automatic feature generation using featuretools.

As I understand, I need to create an integer key in 3 dataframes, which would be used to join all three dataframes. I understand that the same unique integer ids should be the same for the same players.

How can I create unique integer keys for 3 different datasets, ensuring that the same players have the same ids?


Solution

  • You do not need to create an integer key to create the relationships. If your names are unique you can simply use them directly in defining the relationships.

    import pandas as pd
    import featuretools as ft
    
    players = pd.DataFrame({
        "name": ["John", "Jane", "Bill"],
        "date": pd.to_datetime(["2020-01-01", "2020-02-01" ,"2020-03-01"]),
        "other_data": [100, 200, 300]
    })
    scores = pd.DataFrame({
        "game_id": [0, 1, 2],
        "player": ["John", "John", "Jane"],
        "score": [24, 17, 29]
    })
    
    es = ft.EntitySet()
    es.add_dataframe(dataframe_name="players", dataframe=players, index="name")
    es.add_dataframe(dataframe_name="scores", dataframe=scores, index="game_id")
    es.add_relationship("players", "name", "scores", "player")
    

    If your player names are not unique, then you won't be able to create a unique integer id from the names alone. You would have to combine the name with some other piece of information (something like team) to create a new column in your dataframe that uniquely identifies the player in all of your dataframes.