I have to count no. of most occured word from a dataframe in row df['messages']
. It have many columns so I formatted and stored all rows as single string (words joint by space) in one variabel all_words
. all_words
have all words seperated by space. But when i tried to count most common word it shows me most used alphabet.
My data is in form:
0 abc de fghi klm
1 qwe sd fd s dsdd sswd??
3 ded fsf sfsdc wfecew wcw.
Here is snippet of my code.
from collections import Counter
all_words = ' '
for msg in df['messages'].values:
words = str(msg).lower()
all_words = all_words + str(words) + ' '
count = Counter(all_words)
count.most_common(3)
And here is its output:
[(' ', 5260), ('a', 2919), ('h', 1557)]
I also tried using df['messages'].value_counts()
. But it returns most used rows(whole sentence) instead of words.
Like:
asad adas asda 10
asaa as awe 3
wedxew dqwed 1
Please tell me where I am wrong or suggest any other method that might work.
Counter iterates over what you pass to it. If you pass it a string, it goes into iterating it has chars (and that's what it will count). If you pass it a list (where each list is a word), it will count by words.
from collections import Counter
text = "spam and more spam"
c = Counter()
c.update(text) # text is a str, count chars
c
# Counter({'s': 2, 'p': 2, 'a': 3, 'm': 3, [...], 'e': 1})
c = Counter()
c.update(text.split()) # now is a list like: ['spam', 'and', 'more', 'spam']
c
# Counter({'spam': 2, 'and': 1, 'more': 1})
So, you should do something like that:
from collections import Counter
all_words = []
for msg in df['messages'].values:
words = str(msg).lower()
all_words.append(words)
count = Counter(all_words)
count.most_common(3)
# the same, but with generator comprehension
count = Counter(str(msg).lower() for msg in df['messages'].values)