How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.

I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.

Use case:

df.text_column()

keywordlist = [coca , food, soft, aerated, soda]

lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.

another entry like : 'lime soda is the best selling item in fast food stores'

my objective is to get Bigram/trigram like:

'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'

Kindly help me to do that [Python only]

Solution

First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.

Edit: If you haven't already, you need to include the following once in your code, in order to download the stop words list.

nltk.download('stopwords')

Code:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"

# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])

# tokenize the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
    
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)

# get trigrams 
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)

Update

Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.

keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']

def find_matches(n_grams_list):
    matches = []
    for k in keywordlist:
        matching_list = [s for s in n_grams_list if k in s]
        [matches.append(m) for m in matching_list if m not in matches]
    return matches

all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams  
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams

# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)

Output:

['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']