I've been carrying out some KNN classification analysis on a breast cancer dataset in python's sklearn module. I have the following code which attemps to find the optimal k for classification of a target variable.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
breast_cancer_data = load_breast_cancer()
training_data, validation_data, training_labels, validation_labels = train_test_split(breast_cancer_data.data, breast_cancer_data.target, test_size = 0.2, random_state = 40)
results = []
for k in range(1,101):
classifier = KNeighborsClassifier(n_neighbors = k)
classifier.fit(training_data, training_labels)
results.append(classifier.score(validation_data, validation_labels))
k_list = range(1,101)
plt.plot(k_list, results)
plt.ylim(0.85,0.99)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Breast Cancer Classifier Accuracy")
plt.show()
The code loops through 1 to 100 and generates 100 KNN models with 'k' set to incremental values in the range 1 to 100. The performance of each of those models is saved to a list and a plot is generated showing 'k' on the x-axis and model performance on the y-axis.
The problem I have is that when I change the random_state parameter when spliting the data into training and testing partitions this results in completely different plots indicating varying model performance for different 'k'values for different dataset partitions.
For me this makes it difficult to decide which 'k' is optimal as the algorithm performs differently for different 'k's using different random states. Surely this doesn't mean that, for this particular dataset, 'k' is arbitrary? Can anyone shed some light on this?
This is completely expected. When you do the train-test-split, you are effectively sampling from your original population. This means that when you fit a model, any statistic (such as a model parameter estimate, or a model score) will it self be a sample estimate taken from some distribution. What you really want is a confidence interval around this score and the easiest way to get that is to repeat the sampling and remeasure the score.
But you have to be very careful how you do this. Here are some robust options:
The most common solution to this problem is to use k-fold cross-validation. In order not to confuse this k with the k from knn I'm going to use a capital for cross-validation (but bear in mind this is not normal nomenclature) This is a scheme to do the suggestion above but without a target leak. Instead of creating many splits at random, you split the data into K parts (called folds). You then train K models each time on K-1 folds of the data leaving aside a different fold as your test set each time. Now each model is independent and without a target leak. It turns out that the mean of whatever success score you use from these K models on their K separate test sets is a good estimate for the performance of training a model with those hyperparameters on the whole set. So now you should get a more stable score for each of your different values of k (small k for knn) and you can choose a final k this way.
Some extra notes:
GridSearchCV
Pipeline
s. This is very important, make sure you understand it.stratify
argument on train_test_split
.Note the CV is the industry standard and that's what you should do, but there are other options:
You can read about this in detail in introduction to statistical learning section 5.2 (pg 187) with examples in section 5.3.4.
The idea is to take you training set and draw a random sample from it with replacement. This means you end up with some repeated records. You take this new training set, train and model and then score it on the records that didn't make it into the bootstrapped sample (often called out-of-bag samples). You repeat this process multiple times. You can now get a distribution of your score (e.g. accuracy) which you can use to choose your hyper-parameter rather than just the point estimate you were using before.
Jeremy Howard has a very interesting suggestion on how to calibrate your validation set to be a good representation of your test set. You only need to watch about 5 minutes from where that link starts. The idea is to split into three sets (which you should be doing anyway to choose a hyper parameter like k), train a bunch of very different but simple quick models on your train set and then score them on both your validation and test set. It is OK to use the test set here because these aren't real models that will influence your final model. Then plot the validation scores vs the test scores. They should fall roughly on a straight line (the y=x line). If they do, this means the validation set and test set are both either good or bad, i.e. performance in the validation set is representative of performance in the test set. If they don't fall on this straight line, it means the model scores you get from you validation set are not indicative of the score you'll get on unseen data and thus you can't use that split to train a sensible model.
This is obviously not very practical for your situation but I thought I'd mention it for completeness. As your sample size increases, your standard error drops (i.e. you can get tighter bounds on your confidence intervals). But you'll need more training and more test data. While you might not have access to that here, it's worth keeping in mind for real world situations where you can assess the trade-off of the cost of gathering new data vs the desired accuracy in assessing your model performance (and probably the performance itself too).