pythonpandasperformancevectorization

Vectorize processing of log table to determine the latest availability


I have a log table which contains changes. Sign + means addition, sign - means deletion.

import pandas as pd

history = pd.DataFrame({
    "First": 
        ["X","X", "Y", "Y", "X", "X", "Y", "Z"],
    "Last": 
         ["Y", "X", "Y", "Y", "X", "X", "Y", "A"],
    "Change":
        ["+", "+", "-", "+", "-", "+", "+", "-"],
    "Date": 
        ["2022-05-01", "2024-05-01", "2024-06-01", "2024-06-01",
         "2024-05-03", "2024-05-02", "2024-06-02", "2024-06-01"]
})
    
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries chronologically

This produces

  First Last Change        Date
0     X    Y      +  2022-05-01
1     X    X      +  2024-05-01
5     X    X      +  2024-05-02
4     X    X      -  2024-05-03
3     Y    Y      +  2024-06-01
2     Y    Y      -  2024-06-01
7     Z    A      -  2024-06-01
6     Y    Y      +  2024-06-02

In next step, I want to display only what is currently available.

I build this logic using iteration which is very slow. Basically

latest = {}
item_columns = [
    "First",
    "Last",
]  

for _, row in history.iterrows():
    key = tuple(row[column] for column in item_columns)
    if row["Change"] == "+":
        latest[key] = row
    elif row["Change"] == "-" and key in latest:
        del latest[key]

available = pd.DataFrame(latest.keys(), columns=item_columns)

This produces available items

  First Last
0     X    Y
1     Y    Y

The issue is that loop is slow with big tables, e.g. 20 seconds for below

latest = {}
item_columns = [
    "First",
    "Last",
]  

duplicated = pd.concat([history.iloc[[1]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
    key = tuple(row[column] for column in item_columns)
    if row["Change"] == "+":
        latest[key] = row
    elif row["Change"] == "-" and key in latest:
        del latest[key]

available = pd.DataFrame(latest.keys(), columns=item_columns)

Any way to speed up?


Solution

  • If you want to keep the last + row: filter the rows with +, then drop_duplicates:

    out = (history.query('Change == "+"')
                  .drop_duplicates(subset=['First', 'Last'], keep='last')
          )
    

    Output:

      First Last Change        Date
    0     X    Y      +  2022-05-01
    5     X    X      +  2024-05-02
    6     Y    Y      +  2024-06-02
    

    If you want to keep the row per combination, only if it is a +, reverse the operations:

    out = (history.drop_duplicates(subset=['First', 'Last'], keep='last')
                  .query('Change == "+"')
          )
    

    Output:

      First Last Change        Date
    0     X    Y      +  2022-05-01
    6     Y    Y      +  2024-06-02