I'm interested in using logistic regression to classify opera singing (n=100 audiofiles) from non opera singing (n=300 audiofiles) (just an example). I have multiple features that I can use (i.e. MFCC, pitch, signal energy). I would like to use PCA to reduce dimensionality, which will drop the 'least important variables'. My question is, should I do my PCA on my whole dataset (but opera and non-opera)? Because if I do, wouldn't this drop the 'least important variables' for both opera and non-opera rather than drop the variables least important for identifying opera?
You must do your PCA on the whole data.
PCA does not remove the 'least important variables'. PCA is a dimensionality reduction algorithm that is going to find linear combinations of the input features that encode the same amount of information (inertia) using fewer coordinates.
So if your data has N_Feats
you can think of PCA as a matrix of dimension N_Feats x Projection_size
where Projection_size < N_Feats
that you multiply to your data to get a projection of lower dimension
This implies that you need all your features(variables) to compute your projection.
If you think in terms of projections, it doesn't make sense to have 2 different projections for each class. Why? There are 2 reasons: