machine-learningclassificationknn

How to deal with columns in KNN?


Currently, I'm learning ML and started from K neighbours classification and I'm wondering how to deal with all the parameters (columns) given to me. I have only 1 dataset containing 10k rows and I'm splitting it like 80/20. I also have test data in another csv (without y).

But what is bothering me is that I can only get ~78% accuracy while learning and I was wondering, how I can improve my results. Looking at columns I have some questions about these columns in particular:

First dataset contains 2 different groups while test data contains equally distributed points

Reverse situation with the first picture

Same with another columns

Something strange

Should I remove them, or should I try do something with them to use them in my training data?

Also at the moment, I did not understand why my model works better with metric='manhattan' instead of euclidean and how to choose optimal K with the test/train data. I read that you should use sqrt(N), where N is the number of test rows, but is that really the case?

Train/Test data


Solution

  • Actually your question is not clearly. But if you wondering selecting parameter K, you can search Elbow or Silhouette method. These methods calculate error or distances variance to each K parameter which between (2, n) , so you can select best K parameter which your dataset' best represented.

    Also, How did you evaluate your prediction by percentage without y?

    If you wondering about metrics, this is complicated. Each metrics have some advantages and disadvantage. In some case, cosine better than other metrics. This situation depend on how data' feature separated each other.

    If you want to improve your model, you can try normalization feature (softmax, L1, L2 etc.) or dimension reduction (PCA etc.) or feature selection (drop some feature).