python-3.xpandasnlpnltknltk-book

Best way to understand the input text before applying ngram


Currently I am reading text from excel file and applying bigram to it. finalList has list used in below sample code has the list of input words read from input excel file.

Removed the stopwords from input with help of following library:

from nltk.corpus import stopwords

bigram logic applied on list of input text of words

bigram=ngrams(finalList ,2)

input text: I completed my end-to-end process.

Current output: Completed end, end end, end process.

Desired output: completed end-to-end, end-to-end process.

That means some group of words like (end-to-end) should be considered as 1 word.


Solution

  • To solve your problem, you have to clean the stop words using regex. See this example:

     import re
     text = 'I completed my end-to-end process..:?' 
     pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
     new_text = re.sub(pattern, '', text)
     print(new_text)
     'I completed my end-to-end process'
    
    
     # Now you can generate bigrams manually.
     # 1. Tokanize the new text
     tok = new_text.split()
     print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
     ['I', 'completed', 'my', 'end-to-end', 'process']
    
     # 2. Loop over the list and generate bigrams, store them in a var called bigrams
     bigrams = []
     for i in range(len(tok) - 1):  # -1 to avoid index error
         bigram = tok[i] + ' ' + tok[i + 1]  
         bigrams.append(bigram)
    
    
     # 3. Print your bigrams
     for bi in bigrams:
         print(bi, end = ', ')
    
    I completed, completed my, my end-to-end, end-to-end process,
    

    I hope this helps!