pythonpandasdataframepairwise

Filtering cells in pandas dataframe with unique row/column index pairs


I have a NxN pandas dataframe with identical row and column names. Each element represents the pairwise distance between 2 objects with IDs identified by the row and column names (for instance in the 1st column, second row of the dataframe in the attached image, the distance between object 100 and object 1000 is 4407.97). However, half of these cells contain duplicate information because the distance between objects 1000 and 100 is listed in its own cell in the 1st row, 2nd column. Is there a computationally efficient way to retain only the cells with a unique row, column index pair?

Example dataframe Example dataframe

I've tried to solve this with a loop that saves each index/row name combo in both directions as it loops (saving them in a list called "measured_pairs") and only saves a value in a vector called "pair_dist" if its index pair hasn't been looped over yet, but this is taking a very, very long time.

measured_pairs = []
pair_dist = []
for i in range(min_dists.shape[0]):
    rowname = min_dists.iloc[i].name
    colnames = min_dists.iloc[i].index.tolist()
    for col in colnames:
        if str(rowname) + '_' + str(col) in measured_pairs:
            pass
        elif rowname == col:
            pass
        else:
            pair_dist.append(min_dists[col].loc[rowname])
            measured_pairs.append(str(rowname) + '_' + str(col))
            measured_pairs.append(str(col) + '_' + str(rowname))

Solution

  • Assuming your DataFrame is symmetric and you're alright with converting to numpy:

    import pandas as pd 
    import numpy as np
    x = np.ones((5,5))
    df = pd.DataFrame(x, 
                      columns = ["A", "B", "C", "D","E"], 
                      index = ["A", "B", "C", "D","E"])
    
    lower_triangular = np.tril(df.to_numpy())
    
    lower_df = pd.DataFrame(lower_triangular, index = df.index, columns = df.columns)
    

    Where what you're interested in is the upper (or lower) triangular matrix defined in df, and np.tril is an efficient way of getting that matrix.

    If your DataFrame is not symmetric, you'll need to first sort both the columns and the index identically, although based on your preview it looks like it's probably symmetric.