pythonnlpnltk

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column


I have a dataframe with text in one of its columns.

I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.

Use case:

df.text_column()

keywordlist = [coca , food, soft, aerated, soda]

lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.

another entry like : 'lime soda is the best selling item in fast food stores'

my objective is to get Bigram/trigram like:

'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'

Kindly help me to do that [Python only]


Solution

  • First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.

    Edit: If you haven't already, you need to include the following once in your code, in order to download the stop words list.

    nltk.download('stopwords')
    

    Code:

    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    
    word_data = "coca cola is expanding its business in soft drinks and aerated water"
    #word_data = "lime soda is the best selling item in fast food stores"
    
    # load nltk's stop word list
    stop_words = list(stopwords.words('english'))
    # extend the stop words list
    #stop_words.extend(["best", "selling", "item", "fast"])
    
    # tokenize the string and remove stop words
    word_tokens = word_tokenize(word_data)
    clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
        
    # get bigrams
    bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
    print(bigrams_list)
    
    # get trigrams 
    trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
    print(trigrams_list)
    

    Update

    Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.

    keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
    
    def find_matches(n_grams_list):
        matches = []
        for k in keywordlist:
            matching_list = [s for s in n_grams_list if k in s]
            [matches.append(m) for m in matching_list if m not in matches]
        return matches
    
    all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams  
    all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
    
    # join the two lists
    all_matches = all_matching_bigrams + all_matching_trigrams
    print(all_matches)
    

    Output:

    ['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']