machine-learningnlpstanford-nlpcategorization

Which Stanford NLP package to use for content categorization>


I have about 5000 terms in a table and I want to group them into categories that make sense.

For example some terms are:

Nissan
Ford
Arrested
Jeep
Court

The result should be that Nissan, Ford, Jeep get grouped into one category and that Arrested and Court are in another category. I looked at the Stanford Classifier NLP. Am I right to assume that this is the right one to choose to do this for me?


Solution

  • I would suggest you to use NLTK if there weren't many proper nouns. You can use the semantic similarity from WordNet as features and try to cluster the words. Here's a discussion about how to do that.

    To use the Stanford Classifier, you need to know how many buckets (classes) of words you want. Besides I think that is for documents rather than words.