pythonpandasnltkword-frequency

Text analysis: finding the most common word in a column


I have created a dataframe with just a column with the subject line.

df = activities.filter(['Subject'],axis=1)
df.shape

This returned this dataframe:

    Subject
0   Call Out: Quadria Capital - May Lo, VP
1   Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2   Columbia Partners: WW Worked (Not Sure Will Ev...
3   Meeting, Sophie, CFO, CDC Investment
4   Prospecting

I then tried to analyse the text with this code:

import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)

The error message I get is: 'Series' object has no attribute 'Subject'


Solution

  • The error is being thrown because you have converted df to a Series in this line:

    df = activities.filter(['Subject'],axis=1)
    

    So when you say:

    txt = df.Subject.str.lower().str.replace(r'\|', ' ')
    

    df is the Series and does not have the attribute Series. Try replacing with:

    txt = df.str.lower().str.replace(r'\|', ' ')
    

    Or alternatively don't filter your DataFrame to a single Series before and then

    txt = df.Subject.str.lower().str.replace(r'\|', ' ')
    

    should work.

    [UPDATE]

    What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.