pythonnlpn-gramvocabularycountvectorizer

Get specific classes n-grams


I have a dataset of tweets, each labeled as hate (1) or non hate (0). I vectorized the data using a [3,4] character n-grams bag of words (sklearn's CountVectorizer) and I want to extract the most frequent n-grams for each class. The following code works but it generalizes to the whole data instead of focusing on the classes themselves.

bag_of_words = CountVectorizer(
    ngram_range =(3,4),
    analyzer='char'
)

bag_of_words_mx = bag_of_words.fit_transform(X)

vocab = bag_of_words.vocabulary_
count_values = bag_of_words_mx.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()]):
    if ng_count > 1:
        print(ng_count, ng_text)

Is there a way to somehow sort the vocabulary by class?


Solution

  • Try bag_of_words_mx[y == 0] and bag_of_words_mx[y == 1], where y is the array containing your target variable.