pythonnltksentiment-analysislemmatizationpart-of-speech

How to do lemmatization using NLTK or pywsd


I know that my explaination is rather long but I found it necessary. Hopefully someone is patient and a helpful soul :) I'm doing a sentiment analysis project atm and I'm stuck i the pre-process part. I did the import of the csv file, made it into a dataframe, transformed the variables/columns into the right data types. Then I did the tokenization like this, where i choose the variable I wanted to tokenize (tweet content) in the dataframe (df_tweet1):

# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
    print(i)

The output is a list of list with words (tokens).

Then I perform stop word removal:

# Stop word removal
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

clean_sents = []
for m in tokenized_sents:
    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
    clean_sents.append(stop_m)

The output is the same but without stop words

The next two steps are confusing to me (part-of-speech tagging and lemmatization). I tried two things:

1) Convert the previous output into a list of strings

new_test = [' '.join(x) for x in clean_sents]

since I thought that would enable me to use this code to do both steps in one:

from pywsd.utils import lemmatize_sentence

text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)

I got the this error: TypeError: expected string or bytes-like object

2) Perform POS and lemmatizaion seperately. First POS using clean_sents as input:

# PART-OF-SPEECH        
def process_content(clean_sents):
    try:
        tagged_list = []  
        for lst in clean_sents[:500]: 
            for item in lst:
                words = nltk.word_tokenize(item)
                tagged = nltk.pos_tag(words)
                tagged_list.append(tagged)
        return tagged_list

    except Exception as e:
        print(str(e))

output_POS_clean_sents = process_content(clean_sents)

The output is a list of lists with words with a tag attached Then I want to lemmatize this output, but how? I tried two modules, but both gave me error:

from pywsd.utils import lemmatize_sentence

lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
              for s in output_POS_clean_sents]

# AND

from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
              for s in output_POS_clean_sents]
print(lemmatized)

The errors were respectively:

TypeError: expected string or bytes-like object

AttributeError: 'tuple' object has no attribute 'endswith'


Solution

  • If you're using a dataframe I suggest you to store the pre processing steps results in a new column. In this way you can always check the output, and you can always create a list of lists to use as an input for a model in a line of code afterwords. Another advantage of this approach is that you can easily visualise the line of preprocessing and add other steps wherever you need without getting confused.

    Regarding your code, It can be optimised (for example you could perform stop words removal and tokenisation at the same time) and I see a bit of confusion about the steps you performed. For example you performe multiple times lemmatisation, using also different libraries, and there is no point in doing that. In my opinion nltk works just fine, personally I use other libraries to preprocess tweets only to deal with emojis, urls and hashtags, all stuff specifically related to tweets.

    # I won't write all the imports, you get them from your code
    # define new column to store the processed tweets
    df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)
    
    tknzr = TweetTokenizer()
    lmtzr = WordNetLemmatizer()
    
    stop_words = set(stopwords.words("english"))
    new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
    new_stopwords_list = stop_words.union(new_stopwords)
    
    # iterate through each tweet
    for ind, row in df_tweet1.iterrows():
    
        # get initial tweet: ['This is the initial tweet']
        tweet = row['Tweet Content']
    
        # tokenisation, stopwords removal and lemmatisation all at once
        # out: ['initial', 'tweet']
        tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]
    
        # pos tag, no need to lemmatise again after.
        # out: [('initial', 'JJ'), ('tweet', 'NN')]
        tweet = nltk.pos_tag(tweet)
    
        # save processed tweet into the new column
        df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet
    

    So on overall all you need are 4 lines, one for getting the tweet string, two to preprocess the text, another one to store the tweet. You can add extra processing step paying attention to the output of each step (for example tokenisation return a list of strings, pos tagging return a list of tuples, reason why you are getting troubles).

    If you want then you can create a list of lists containing all tweet in the dataframe:

    # out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
    all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]