pythontagssentencepart-of-speech

How to remove an entire line if it does not have a pos tag like CD?


I am reading a news article and pos-tagging with nltk. I want to remove those lines that does not have a pos tag like CD (numbers).

import io
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english')) 
file1 = open("etorg.txt") 
line = file1.read()
file1.close()
print(line)
words = line.split() 
tokens = nltk.pos_tag(words)

How do I remove all sentences that do not contain the CD tag?


Solution

  • Just use [word for word in tokens if word[1] != 'CD']

    EDIT: To get the sentences that have no numbers, use this code:

    def has_number(sentence):
        for i in nltk.pos_tag(sentence.split()):
            if i[1] == 'CD':
                return ''
        return sentence
    
    line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '
    
    ''.join([has_number(x) for x in line.split('.')])
    
    > ' However, industry sources do not confirm this data '