pythonmachine-learningcorrelationunsupervised-learningfeature-engineering

Does correlation important factor in Unsupervised learning (Clustering)?


I am working with the dataset of size (500, 33).

In particular the data set contains 9 features say

[X_High, X_medium, X_low, Y_High, Y_medium, Y_low, Z_High, Z_medium, Z_low]

Both visually & after correlation matrix calculation I observed that

[X_High, Y_High, Z_High] & [ X_medium, Y_medium, Z_medium ] & [X_low, Y_low, Z_low] are highly correlated (above 85%).

I would like to perform a Clustering algorithm (say K means or GMM or DBSCAN).

In that case,

Is it necessary to remove the correlated features for Unsupervised learning ? Whether removing correlation or modifying features creates any impact ?


Solution

  • My assumption here is that you're asking this question because in cases of linear modeling, highly collinear variables can cause issues.

    The short answer is no, you don't need to remove highly correlated variables from clustering for collinearity concerns. Clustering doesn't rely on linear assumptions, and so collinearity wouldn't cause issues.

    That doesn't mean that using a bunch of highly correlated variables is a good thing. Your features may be overly redundant and you may be using more data than you need to reach the same patterns. With your data size/feature set that's probably not an issue, but for large data you could leverage the correlated variables via PCA/dimensionality reduction to reduce your computation overhead.