Let's say I have the following datafarme df1
corresponding to user1
:
+-------------------+-------+--------+-------+-------+----------+----------------+
| Models | MAE | MSE | RMSE | MAPE | R² score | Runtime [ms] |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression | 4.906 | 27.784 | 5.271 | 0.405 | -6.917 | 0:00:43.387145 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| Random Forest | 2.739 | 10.239 | 3.2 | 0.231 | -1.917 | 0:28:11.761681 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| XGBoost | 2.826 | 10.898 | 3.301 | 0.234 | -2.105 | 0:03:58.883474 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| MLPRegressor | 5.234 | 30.924 | 5.561 | 0.43 | -7.812 | 0:01:44.252276 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| SVR | 5.061 | 29.301 | 5.413 | 0.417 | -7.349 | 0:04:52.754769 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.454 | 8.823 | 2.97 | 0.201 | -1.514 | 0:19:36.925169 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LGBMRegressor | 2.76 | 10.204 | 3.194 | 0.231 | -1.907 | 0:04:51.223103 |
+-------------------+-------+--------+-------+-------+----------+----------------+
+-------------------+----------------------------------------------------------------------------------------------------------+
| Rank | MAE |
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"] |
+-------------------+----------------------------------------------------------------------------------------------------------+
I have following datafarme df2
corresponding to user2
:
+-------------------+-------+--------+-------+-------+----------+----------------+
| Models | MAE | MSE | RMSE | MAPE | R² score | Runtime [ms] |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LinearRegression | 4.575 | 24.809 | 4.981 | 0.377 | -6.079 | 0:00:45.055854 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| Random Forest | 2.345 | 8.065 | 2.84 | 0.199 | -1.301 | 0:10:55.468473 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| XGBoost | 2.129 | 7.217 | 2.686 | 0.179 | -1.059 | 0:01:01.575033 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| MLPRegressor | 4.414 | 23.477 | 4.845 | 0.363 | -5.699 | 0:00:31.231719 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| SVR | 4.353 | 22.826 | 4.778 | 0.357 | -5.513 | 0:02:12.258870 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| CatBoostRegressor | 2.281 | 7.671 | 2.77 | 0.189 | -1.189 | 0:08:16.526615 |
+-------------------+-------+--------+-------+-------+----------+----------------+
| LGBMRegressor | 2.511 | 9.18 | 3.03 | 0.212 | -1.619 | 0:15:25.084937 |
+-------------------+-------+--------+-------+-------+----------+----------------+
+-------------------+----------------------------------------------------------------------------------------------------------+
| Rank | MAE |
+-------------------+----------------------------------------------------------------------------------------------------------+
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"] |
+-------------------+----------------------------------------------------------------------------------------------------------+
Let's say I have more datafarmes df1000
corresponding to user1000
.
Problem statement: I want to count how often each ranking order occurs across all users (for a given metric). (And then, sort the ranking orders by their counts, and, additionally, compute the percentage of how often each particular ranking order occurs (based on the counts).)
I want to rank Models
result (sorted over a specific column (e.g. MAE
) iteratively and return the frequency of top models over all dfs (df1
till df1000
). so this is not something I can easily reach using the:
df["category"].value_counts()
we are interested in computing absolute\relative frequencies in final ranked table in expected output. so definitely I need to transform and add the list of sorted models' names that'd be a list of strings.
Possible transformation or aggregation stages from my understanding:
['model2','model7', 'model6', 'model5', 'model4', 'model3', 'model1' ]
Users
in the final transformed dataframe could also be useful (however I did not mention it in the following table in the expected output.)counts
and freq(%)
in final tableExpected output:
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
| Rank | MAE |counts |freq(%) |
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
| Top models(sorted)| ["CatBoostRegressor","RandomForest","LGBMRegressor", "XGBoost","LinearRegression","SVR","MLPRegressor"] | 70 | 65% |
| Top models(sorted)| ["XGBoost","CatBoostRegressor","RandomForest","LGBMRegressor","LinearRegression","SVR","MLPRegressor"] | 20 | 12% |
| Top models(sorted)| .... | .... | .... |
....
+-------------------+----------------------------------------------------------------------------------------------------------+--------+---------+
I also was thinking maybe I can use Natural Language Processing (NLP) methods called TF-IDF to handle this problem using:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
Potentially related posts I have checked:
Does this work?
import pandas as pd
from collections import Counter
# I've left out the other metrics because they are irrelevant to your question, but you can add them
data1 = {
'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
'MAE': [4.906, 2.739, 2.826, 5.234, 5.061, 2.454, 2.76]
}
data2 = {
'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}
data3 = {
'Models': ['LinearRegression', 'Random Forest', 'XGBoost', 'MLPRegressor', 'SVR', 'CatBoostRegressor', 'LGBMRegressor'],
'MAE': [4.575, 2.345, 2.129, 4.414, 4.353, 2.281, 2.511]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
# add your 1000 data frames here if you want
dataframes = [df1, df2, df3]
concatenated_models_list = []
for df in dataframes:
df_sorted = df.sort_values(by='MAE', ascending=False)
# Concatenate the sorted-by-MAE model names into one string, so that
# you can group by it as a key later
concatenated_models = ','.join(df_sorted['Models'].tolist())
concatenated_models_list.append(concatenated_models)
# Union everything into a single dataframe
union_df = pd.DataFrame(concatenated_models_list, columns=['Top models(sorted)'])
grouped_df = union_df['Top models(sorted)'].value_counts().reset_index()
grouped_df.columns = ['Top models(sorted)', 'count']
grouped_df['freq(%)'] = (grouped_df['count'] / len(dataframes)) * 100
grouped_df
Output
Top models(sorted) count freq(%)
LinearRegression,MLPRegressor,SVR,LGBMRegresso... 2 66.666667
MLPRegressor,SVR,LinearRegression,XGBoost,LGBM... 1 33.333333