pandas nlp nltk sentiment-analysis treetagger

Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon

I am writing an algorithm in python which processes a column of sentences and then gives the polarity (positive or negative) of each cell of my column of sentences. The script uses a list of negative and positive word from the NRC emotion lexicon (French version) I am having a problem writing the preprocess function. I have already written the count function and the polarity function but since I have some difficulty writing the preprocess function, I am not really sure if those functions works.

The positive and negative words were in the same file (lexicon) but I export positive and negztive words separately because I did not know how to use the lexicon as it was.

My function count occurrence of positive and negative does not work and I do not know why it Always sends me 0. I Added positive word in each sentence so the should appear in the dataframe:

stacktrace :


[4 rows x 6 columns]
   id                                           Verbatim      ...       word_positive  word_negative
0  15  Je n'ai pas bien compris si c'était destiné a ...      ...                   0              0
1  44  Moi aérien affable affaire agent de conservati...      ...                   0              0
2  45  Je affectueux affirmative te hais et la Foret ...      ...                   0              0
3  47  Je absurde accidentel accusateur accuser affli...      ...                   0              0

=>  
def count_occurences_Pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]


    return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))

This my csv_data : line 44 , 45 contains positive words and line 47 more negative word but in the column of positive and negative word , it is alwaqys empty, the function does not return the number of words and the final column is Always positive whereas the last sentence is negative

id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue

Here the full code :

# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict, Counter



csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())

stopWords = set(stopwords.words('french'))  
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')     
def process_text(text):
    '''extract lemma and lowerize then removing stopwords.'''

    text_preprocess =[]
    text_without_stopwords= []

    text = tagger.tag_text(text)
    for word in text:
        parts = word.split('\t')
        try:
            if parts[2] == '':
                text_preprocess.append(parts[1])
            else:
                text_preprocess.append(parts[2])
        except:
            print(parts)


    text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
    return text_without_stopwords

csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)


lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''

    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)


#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)

lexiconneg = open('negative.txt', 'r', encoding='utf-8')

def count_occurences_neg(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)

def polarity_score(text):   
    ''' give the polarity of each text based on the number of positive and negative word '''
    positives_text =count_occurences_pos(text, lexiconpos)
    negatives_text =count_occurences_neg(text, lexiconneg)
    if positives_text > negatives_text :
        return "positive"
    else : 
        return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)

If you could also see if the rest of the code is good to thank you.

Solution

I have found your error! It comes from the Polarity_score function.

It's just a typo : In your, if statement you were comparing count_occurences_Pos and count_occurences_Neg which are function instead of comparing the results of the function count_occurences_pos and count_occurences_peg

Your code should be like this :

def Polarity_score(text):
    ''' give the polarity of each text based on the number of positive and negative word '''
    count_text_pos =count_occurences_Pos(text, word_list)
    count_text_neg =count_occurences_Neg(text, word_list)
    if count_occurences_pos > count_occurences_peg :
        return "Positive"
    else : 
        return "negative"

In the future, you need to learn how to have meaningful names for your variables to avoid those kinds of errors With correct variables names, your function should be :

 def polarity_score(text):
        ''' give the polarity of each text based on the number of positive and negative word '''
        positives_text =count_occurences_pos(text, word_list)
        negatives_text =count_occurences_neg(text, word_list)
        if positives_text > negatives_text :
            return "Positive"
        else : 
            return "negative"

Another improvement you can make in your count_occurences_pos and count_occurences_neg function is to use set instead of the list. Your text and world_list can be converted to sets and you can use the set intersection to retrieve the positive texts in them.Because set are faster than lists