python-3.xpandasnlpnltktrigram

Ngrams from pandas column


I have a pandas dataframe, with the following columns :

Column 1

['if', 'you', 'think', 'she', "'s", 'cute', 'now', ',', 'you', 'should', 'have', 'see', 'her', 'a', 'couple', 'of', 'year', 'ago', '.']
['uh', ',', 'yeah', '.', 'just', 'a', 'fax', '.']

Column 2

if you think she 's cute now , you should have see her a couple of year ago .
uh , yeah . just a fax .

etc.

My target is to count the bigrams, trigrams, quadrigrams of the dataframe (and specifically, the column 2, which is already lemmatized).

I tried the following :

import nltk
from nltk import bigrams
from nltk import trigrams

trig = trigrams(df ["Column2"])
print (trig)

However, I have the following error

<generator object trigrams at 0x0000013C757F1C48>

My final target is to be able to print the top X bi grams, trigrams etc.


Solution

  • Use list comprehension with split and flatten for all trigrams first:

    df = pd.DataFrame({'Column2':["if you think she cute now you if uh yeah just",
                                  'you think she uh yeah just a fax']}) 
    
    from nltk import trigrams
    
    L = [x for x in df['Column2'] for x in trigrams(x.split())]
    print (L)
    [('if', 'you', 'think'), ('you', 'think', 'she'), ('think', 'she', 'cute'), 
     ('she', 'cute', 'now'), ('cute', 'now', 'you'), ('now', 'you', 'if'), 
     ('you', 'if', 'uh'), ('if', 'uh', 'yeah'), ('uh', 'yeah', 'just'), 
     ('you', 'think', 'she'), ('think', 'she', 'uh'), ('she', 'uh', 'yeah'),
     ('uh', 'yeah', 'just'), ('yeah', 'just', 'a'), ('just', 'a', 'fax')]
    

    Then count tuples by collections.Counter:

    from collections import Counter
    c = Counter(L)
    print (c)
    Counter({('you', 'think', 'she'): 2, ('uh', 'yeah', 'just'): 2, ('if', 'you', 'think'): 1,
             ('think', 'she', 'cute'): 1, ('she', 'cute', 'now'): 1, ('cute', 'now', 'you'): 1,
             ('now', 'you', 'if'): 1, ('you', 'if', 'uh'): 1, ('if', 'uh', 'yeah'): 1, 
             ('think', 'she', 'uh'): 1, ('she', 'uh', 'yeah'): 1, 
             ('yeah', 'just', 'a'): 1, ('just', 'a', 'fax'): 1})
    

    And for top values use collections.Counter.most_common:

    top = c.most_common(5)
    print (top)
    [(('you', 'think', 'she'), 2), (('uh', 'yeah', 'just'), 2), 
     (('if', 'you', 'think'), 1), (('think', 'she', 'cute'), 1),
     (('she', 'cute', 'now'), 1)]