I have created a dataframe with just a column with the subject line.
df = activities.filter(['Subject'],axis=1)
df.shape
This returned this dataframe:
Subject
0 Call Out: Quadria Capital - May Lo, VP
1 Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2 Columbia Partners: WW Worked (Not Sure Will Ev...
3 Meeting, Sophie, CFO, CDC Investment
4 Prospecting
I then tried to analyse the text with this code:
import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)
The error message I get is: 'Series' object has no attribute 'Subject'
The error is being thrown because you have converted df
to a Series in this line:
df = activities.filter(['Subject'],axis=1)
So when you say:
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
df is the Series and does not have the attribute Series. Try replacing with:
txt = df.str.lower().str.replace(r'\|', ' ')
Or alternatively don't filter your DataFrame to a single Series before and then
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
should work.
[UPDATE]
What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.