pythonpandaspivot-table

Is there a way to update pandas.pivot_table without reconstructing it?


I have a student table that contains student_id, course_id, and exam_time (10k rows). I pivot on student_id and exam_time to get the number of exams in a session or in a day. I am building a timetabling heuristic that changes the times of the examinations one at a time so I need to update this pivot table a lot of times. A change in one course's examination time affects an average of 50 rows in the original dataframe. Is there a way to update the resulting pivot table without recalculating the whole thing in pandas or should I keep track of the changes on the pivot table myself (i.e. by adding and subtracting 1 to the changed slots)?

Edit: Here's how I construct the pivot table. I added a column of ones to count the numbers by np.sum. I couldn't find another function that works faster.

sLength = len(df["student_id"])
df["ones"] = pd.Series(np.ones(sLength))
pivot_table = pd.pivot_table(df, rows = "student_id", cols = "exam_time", values = "ones", aggfunc = np.sum)

And for the changes in examination times, I wrote this (assuming changed_course is moved from old_slot to new_slot)

affected_students = df[df["course_id"] == changed_course]["student_id"]
pivot_table[old_slot][affected_students] -= 1
pivot_table[new_slot][affected_students] += 1

Solution

  • Here is sample code, the idea is update the total pivot table by subtract the pivot table of old rows and add the pivot table of new rows.

    So every time you change the data, you call twice pivot_table(), and one add() and one sub():

    import numpy as np
    import pandas as pd
    
    ### create random data
    N = 1000
    a = np.random.randint(0, 100, N)
    b = np.random.randint(0, 30, N)
    c = np.random.randint(0, 10, N)
    
    df = pd.DataFrame({"a":a, "b":b, "c":c})
    
    ### calculate pivot sum
    res = df.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
    
    ### create random rows to change
    M = 100
    row_index = np.unique(np.random.randint(0, N, M))
    old_rows = df.iloc[row_index]
    M = old_rows.shape[0]
    new_rows = pd.DataFrame({"a":np.random.randint(0, 100, M), 
                             "b":np.random.randint(0, 30, M),
                             "c":np.random.randint(0, 10, M)})
    
    ### update pivot table
    sub_df = old_rows.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
    add_df = new_rows.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
    new_res = res.sub(sub_df, fill_value=0).add(add_df, fill_value=0)
    
    ### check result
    df.iloc[row_index] = new_rows.values
    res2 = df.pivot_table(values="c", index="a", columns="b", aggfunc="sum", fill_value=0)
    print(new_res.astype(int).equals(res2))