pythonpandasdataframetruthtable

Is there a better way to transform a data frame into a "truth table" than looping through it?


I have a view in Redshift that I'm reading from to create a data frame. The table is structured like the following with about 49k records:

session_id timestamp event_text
session1 2020-07-07 06:45:45.012 event-A
session1 2020-07-10 04:19:07.477 event-B
session2 2020-07-10 16:42:24.46 event-B
session2 2020-07-10 18:57:12.358 event-C
session3 2020-07-10 16:42:24.46 event-A
session3 2020-07-10 18:57:12.358 event-C

My goal was to create a data frame structured like this:

session_id event-A event-B event-C
session1 1 1 0
session2 0 1 1
session3 1 0 1

I know this structure as a "truth table" but not sure what other people call it.

I found a way to do this in Python by looping through the query results, like this:

import pandas as pd
import pandas.io.sql as sqlio

# Redshift query
df = sqlio.read_sql_query(master_order_event_view, conn) 

events = df.event_text.unique()
unique_sessions = df.session_id.unique()

# Creating Dataframe with session IDs as index and event_text values as columns
truth_df = pd.DataFrame(0, index=unique_sessions, columns=events) 

for session_id, event_text in zip(df["session_id"], df["event_text"]):
    truth_df.at[session_id, event_text] = 1
    
return truth_df

My question is: is there a built-in Pandas command/library to do this? I looked around the docs a lot but I couldn't find anything that satisfied it. Perhaps calling it a "truth table" is my problem.

Whether or not there's a built-in way to do this, does anyone have any suggestions of how to make this more efficient? Restructuring the view, using a different library, etc.

Thanks!


Solution

  • Big shoutout to @Dani Mesejo and @sammywemmy who gave me some suggestions.

    for session_id, event_text in zip(df["session_id"], df["event_text"]):
        truth_df.at[session_id, event_text] = 1
    

    can be replaced by either of the following:

    Using crosstab:

    truth_df = pd.crosstab(df.session_id, df.event_text)
    

    or groupby:

    truth_df = df.groupby(["session_id", "event_text"]).size().unstack(fill_value=0)
    

    Both work great but according to @sammywemmy , groupby is a bit speedier.