I am using random forest as my regression model and have the following data where X is a high dimensional data set in a simplex form. I have tried using permutation and the following code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
X = pd.read_csv("X_p.csv", delimiter=",", engine="c", index_col=0, low_memory=False)
Y = pd.read_csv("Y_p.csv", delimiter=",", engine="c", low_memory=False)
X = X.iloc[:, 1:]
Y = Y.iloc[:, 1:]
Y = Y['0'].values.ravel()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
rf_model = RandomForestRegressor(n_estimators=100,min_samples_split=5,max_depth=50, random_state=42)
original_mse = mean_squared_error(Y_train, rf_model.fit(X_train, Y_train).predict(X_train))
permuted_mse = []
for _ in range(10): # Number of permutations
permuted_Y = np.random.permutation(Y_train)
permuted_mse.append(mean_squared_error(permuted_Y, rf_model.fit(X_train, permuted_Y).predict(X_train)))
# Step 6: Significance Testing based on MSE
p_value_mse = np.sum(permuted_mse > original_mse) / (len(permuted_mse)+1)
# Print results
print("Original MSE:", original_mse)
print("Permutation test p-value based on MSE:", p_value_mse)
output:
Original MSE: 2.2922379491349436
Permutation test p-value based on MSE: 0.9090909090909091
Process finished with exit code 0
Just running 10 permutations the code takes quite a long time and it always returns a very high p value (bassicly all permuted mse's are larger than the initial one) what is my mistake? I have tried to modify my random forest parameters but nothings seems to help.
I guess the idea behind using mse is to evaluate that our initial mse should be lower than a randomly permuted vision of the data
That highlights the problem with the implementation of the permutation test: your alternative hypothesis is that the observed MSE is lower than if Y were independent of X. In other words, a lower MSE is "more extreme" than one would expect to get by chance under the null hypothesis. So the last step should be:
p_value_mse = (np.sum(permuted_mse <= original_mse) + 1) / (len(permuted_mse)+1)
This makes three changes to the above code:
<
instead of >
to correspond with the alternative hypothesis being tested.<=
instead of <
because elements of the null distribution equal to your observed statistic would also be considered "extreme".+1
in the numerator as well as the denominator (reference). These +1
s are commonly interpreted as including the original observation in the permutation distribution, which gives an intuitive reason for including them in both the numerator and denominator. See 1 for more information, though: there is a case to be made for this being an approximation to a (long-run) "exact" p-value (despite the use of randomization).You might also read the SciPy tutorial about Resampling and Monte Carlo Methods if you're unfamiliar with the topic of permutation tests.