I'm doing text classification and will be dealing with words that are not captured in my training data, meaning the word should be treated as unknown.
Does anyone know if scikit's cross validation will treat a particular word as unseen if it does not exist in the training data?
Or will scikit treat all words as features even if its not in the training set?
If you do the cross validation on a pipeline that wraps both the feature extractor (e.g. CountVectorizer or TfidfVectorizer) and the classifier then everything will work out of the box automatically: features that occur only in the train test set will just be ignored (not mapped to a dimension in the vector representation).
There is more details about how the vocabulary_
attribute is used to map feature names to dimensions in the documentation on text feature extraction.
There is also an example that shows how to cross validate a pipeline that comprise a feature extraction component and a classifier.
Edit: fixed train / test typo
Edit 2: fixed broken link to example.