pythonpandasdataframefrequencytf-idf

What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?


Let's say I have the following datafarme df1 corresponding to user1:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.906 | 27.784 | 5.271 | 0.405 |  -6.917  | 0:00:43.387145 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.739 | 10.239 |  3.2  | 0.231 |  -1.917  | 0:28:11.761681 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.826 | 10.898 | 3.301 | 0.234 |  -2.105  | 0:03:58.883474 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 5.234 | 30.924 | 5.561 | 0.43  |  -7.812  | 0:01:44.252276 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 5.061 | 29.301 | 5.413 | 0.417 |  -7.349  | 0:04:52.754769 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.454 | 8.823  | 2.97  | 0.201 |  -1.514  | 0:19:36.925169 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.76  | 10.204 | 3.194 | 0.231 |  -1.907  | 0:04:51.223103 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

I have following datafarme df2 corresponding to user2:

+-------------------+-------+--------+-------+-------+----------+----------------+
|      Models       |  MAE  |  MSE   | RMSE  | MAPE  | R² score |  Runtime [ms]  |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression  | 4.575 | 24.809 | 4.981 | 0.377 |  -6.079  | 0:00:45.055854 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   Random Forest   | 2.345 | 8.065  | 2.84  | 0.199 |  -1.301  | 0:10:55.468473 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|      XGBoost      | 2.129 | 7.217  | 2.686 | 0.179 |  -1.059  | 0:01:01.575033 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   MLPRegressor    | 4.414 | 23.477 | 4.845 | 0.363 |  -5.699  | 0:00:31.231719 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|        SVR        | 4.353 | 22.826 | 4.778 | 0.357 |  -5.513  | 0:02:12.258870 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.281 | 7.671  | 2.77  | 0.189 |  -1.189  | 0:08:16.526615 |
+-------------------+-------+--------+-------+-------+----------+----------------+
|   LGBMRegressor   | 2.511 |  9.18  | 3.03  | 0.212 |  -1.619  | 0:15:25.084937 |
+-------------------+-------+--------+-------+-------+----------+----------------+


+-------------------+----------------------------------------------------------------------------------------------------------+
|      Rank         |                                                          MAE                                             | 
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]  |  
+-------------------+----------------------------------------------------------------------------------------------------------+

Let's say I have more datafarmes df1000 corresponding to user1000.

Problem statement: I want to count how often each ranking order occurs across all users (for a given metric). (And then, sort the ranking orders by their counts, and, additionally, compute the percentage of how often each particular ranking order occurs (based on the counts).)

I want to rank Models result (sorted over a specific column (e.g. MAE ) iteratively and return the frequency of top models over all dfs (df1 till df1000). so this is not something I can easily reach using the:

df["category"].value_counts()

we are interested in computing absolute\relative frequencies in final ranked table in expected output. so definitely I need to transform and add the list of sorted models' names that'd be a list of strings.

Possible transformation or aggregation stages from my understanding:

  1. take each df and create the list of sorted model names based desired column or metric: ['model2','model7', 'model6', 'model5', 'model4', 'model3', 'model1' ]
  2. including the name of Users in the final transformed dataframe could also be useful (however I did not mention it in the following table in the expected output.)
  3. computing absolute\relative frequencies and return as counts and freq(%) in final table

Expected output:

+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
|      Rank         |                                                    MAE                                                  |counts  |freq(%)  |
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"]   | 70     |   65%   |
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"]   | 20     |   12%   |
| Top models(sorted)|                                             ....                                                  | ....     |   ....  |
....
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+

I also was thinking maybe I can use Natural Language Processing (NLP) methods called TF-IDF to handle this problem using:

# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

Potentially related posts I have checked:


Solution

  • Does this work?

    import pandas as pd
    from collections import Counter
    
    # I've left out the other metrics because they are irrelevant to your question, but you can add them
    data1 = {
        'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
        'MAE': [4.906, 2.739, 2.826, 5.234, 5.061, 2.454, 2.76]
    }
    data2 = {
        'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
        'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
    }
    data3 = {
        'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
        'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
    }
    
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    df3 = pd.DataFrame(data3)
    
    # add your 1000 data frames here if you want
    dataframes = [df1, df2, df3]
    
    concatenated_models_list = []
    
    for df in dataframes:
        df_sorted = df.sort_values(by='MAE', ascending=False)
        # Concatenate the sorted-by-MAE model names into one string, so that
        # you can group by it as a key later
        concatenated_models = ','.join(df_sorted['Models'].tolist())
        concatenated_models_list.append(concatenated_models)
    
    # Union everything into a single dataframe
    union_df = pd.DataFrame(concatenated_models_list, columns=['Top models(sorted)'])
    
    grouped_df = union_df['Top models(sorted)'].value_counts().reset_index()
    grouped_df.columns = ['Top models(sorted)', 'count']
    
    grouped_df['freq(%)'] = (grouped_df['count'] / len(dataframes)) * 100
    
    grouped_df
    
    

    Output

    Top models(sorted)  count   freq(%)
    LinearRegression,MLPRegressor,SVR,LGBMRegresso...   2   66.666667
    MLPRegressor,SVR,LinearRegression,XGBoost,LGBM...   1   33.333333