pythonmachine-learningscikit-learnsimilaritymse

Measuring similarity using RMSE


I have the following data as follows:

object l2a l2b l4 l5
a 0.6649 0.5916 0.033569 0.557373
b 0.8421 0.5132 0.000000 0.697193
c 0.6140 0.2807 0.084217 0.650313
d 0.7619 0.3810 0.000000 0.662306
e 0.6957 0.3043 0.000000 0.645135

Is it possible to measure the similarity between (a-b), (a-c), (a-d), (a-e), (b-c), ..., (d,e) using RMSE?

For example:

Similarity between object a (_a) and object b (_b):

diff_l2a = l2a_a - l2a_b

diff_l2b = l2b_a - l2b_b

diff_l4 = l4_a - l4_b

diff_l5 = l5_a - l5_b

Then calculate the RMSE:

RMSEs = [RMSE(diff_l2a, diff_l2b), RMSE(diff_l2a, diff_l4), RMSE(diff_l2a, diff_l5), ..., RMSE(diff_l4, diff_l5)]

Similarity:

average(RMSEs)

Solution

  • RMSE Similarity DF Code part:

    num_objects = len(df)
    sim_matrix = np.zeros((num_objects, num_objects))
    
    for i in range(num_objects):
        for j in range(i + 1, num_objects):
            rmse = np.sqrt(mean_squared_error(attributes[i], attributes[j]))
            sim_matrix[i, j] = rmse
            sim_matrix[j, i] = rmse
    

    Code (with DF):

    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error
    
    data = {
        'object': ['a', 'b', 'c', 'd', 'e'],
        'l2a': [0.6649, 0.8421, 0.6140, 0.7619, 0.6957],
        'l2b': [0.5916, 0.5132, 0.2807, 0.3810, 0.3043],
        'l4': [0.033569, 0.0, 0.084217, 0.0, 0.0],
        'l5': [0.557373, 0.697193, 0.650313, 0.662306, 0.645135]
    }
    df = pd.DataFrame(data)
    attributes = df.iloc[:, 1:].values
    
    num_objects = len(df)
    sim_matrix = np.zeros((num_objects, num_objects))
    
    for i in range(num_objects):
        for j in range(i + 1, num_objects):
            rmse = np.sqrt(mean_squared_error(attributes[i], attributes[j]))
            sim_matrix[i, j] = rmse
            sim_matrix[j, i] = rmse
    
    sim_df = pd.DataFrame(sim_matrix, columns=df['object'], index=df['object'])
    
    print("Similarity Matrix:")
    print(sim_df)
    
    sim = sim_df.values[sim_df.values != 0.0]
    average_sim = sim.mean()
    print(f"Average Similarity (excluding 0.0): {average_sim:.3f}")
    

    Output:

    enter image description here

    Addition:

    If you want to calculate pairwise RMSE-based similarity:

    from scipy.spatial.distance import pdist, squareform
    sim_matrix = np.sqrt(squareform(pdist(attributes, 'euclidean')))
    

    Others: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html