machine-learningnlpdata-sciencefeature-selectionfeature-detection

Which Feature Selection Techniques for NLP is this represent


I have a dataset that came from NLP for technical documents

my dataset has 60,000 records

There are 30,000 features in the dataset

and the value is the number of repetitions that word/feature appeared

here is a sample of the dataset

RowID       Microsoft  Internet  PCI  Laptop  Google  AWS  iPhone  Chrome
1              8          2       0      0      5      1      0       0
2              0          1       0      1      1      4      1       0
3              0          0       0      7      1      0      5       0
4              1          0       0      1      6      7      5       0
5              5          1       0      0      5      0      3       1
6              1          5       0      8      0      1      0       0

-------------------------------------------------------------------------
Total          9,470     821      5     107     4,605  719    25      8
Appearance

There are some words that only appeared less than 10 times in the whole dataset

The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)

what is this technique called? the one that only uses features that in total appeared more than a certain number.


Solution

  • This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.

    If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.

    References

    [1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.