I want to make clear one question that bothers me.
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
###
Rest of the preparation
###
mean_sqrd_error = cross_val_score(rfr,
x,
y,
scoring = 'neg_mean_squared_error')
sqrd = mean_squared_error(y_test, y_pred)
Are these two doing the same thing and cross_val_Score just doing model.predict()
by itself ?
How does sklearn do maths when using cross_val_score
, if i cant give him y_pred
as an argument. Or can i?
mse = cross_val_score(model, x, y, cv=3, scoring=...)
cross_val_score()
partitions the supplied data into a train fold and validation fold. It trains model
on the train fold, and scores it on the validation fold. This is repeated cv=3
times, so you get 3 scores. You can average those scores to get a final number.
For a dataset comprising 9 samples, the mechanics of cross-validation are:
model
on samples [4,5,6,7,8,9]
, predict and score on samples 1-3model
on samples [1,2,3,7,8,9]
, predict and score on samples 4-6model
on samples [1,2,3,4,5,6]
, predict and score on samples 7-9The indices could be shuffled in advance. By training on some samples, and scoring on the 'unseen' (out-of-fold) samples, the scores give you a better measure of how the model performs on unseen data.
If you instead train the model on the entire dataset, and then score it on the same samples it was trained on, the scores would be biased and not a good measure of how the model handles new data.
You often want scores and predictions on samples the model has not been trained on, which is what the cross_val_*
functions provide. cross_val_predict()
gives you the predictions for each fold (out-of-fold predictions), whereas cross_val_score()
gives you the scores for each fold (out-of-fold scores).
y_pred
is not relevant to these functions. For each fold, they internally fit a 'clean' model, and then predict/score that fitted model on only the held out samples.
mse = mean_squared_error(y_test, y_pred)
This is just a generic scoring function. It calculates a score simply based on the data you give it. y_pred
could be from anywhere, CV or otherwise.
If you wanted to use that function in your own CV loop, the code would be along the lines of:
fold_scores = []
for train_indices, val_indices in KFold(n_splits=3).split(X, y):
fold_model = clone(model).fit(X[train_indices], y[train_indices])
score = mean_squared_error(y[val_indices], fold_model.predict(y[val_indices]))
fold_scores.append(score)
sklearn
does something similar to this when you supply scoring="neg_mean_squared_error"
.