My dataframe looks like this:
ID topics text
1 1 twitter is my favorite social media
2 1 favorite social media
3 2 rt twitter tomorrow
4 3 rt facebook today
5 3 rt twitter
6 4 vote for the best twitter
7 2 twitter tomorrow
8 4 best twitter
I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.
I want my output to look like this.
ID topics text biagram
1 1 twitter is my favorite social favorite social
2 1 favorite social media favorite social
3 2 rt twitter tomorrow twitter tomorrow
4 2 twitter tomorrow twitter tomorrow
5 3 rt twitter rt twitter
6 3 rt facebook today rt twitter
7 4 vote for the bes twitter best twitter
8 4 best twitter best twitter
Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.
This code will be run on 6M rows of data, so it needs to be fast.
What is the best way to do it using pandas? I apologize if it seems too complicated.
Update
You can use sklearn
:
trom sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
data = vect.fit_transform(df['text'])
bigram = (pd.DataFrame(data=data.toarray(),
index=df['topics'],
columns=vect.get_feature_names_out())
.groupby('topics').sum().idxmax(axis=1))
df['bigram'] = df['topics'].map(bigram)
print(df)
# Output
ID topics text bigram
0 1 1 twitter is my favorite social media favorite social
1 2 1 favorite social media favorite social
2 3 2 rt twitter tomorrow twitter tomorrow
3 4 3 rt facebook today facebook today
4 5 3 rt twitter facebook today
5 6 4 vote for the best twitter best twitter
6 7 2 twitter tomorrow twitter tomorrow
7 8 4 best twitter best twitter
Update 2
how about if I want the 3 most frequent ngrams. What can I use instead of idxmax()?
most_common3 = lambda x: x.sum().nlargest(3).index.to_frame(index=False).squeeze()
bigram = (pd.DataFrame(data=data.toarray(),
index=df['topics'],
columns=vect.get_feature_names_out())
.groupby('topics').apply(most_common3)
.rename(columns=lambda x: f"bigram{x+1}").reset_index())
df = df.merge(bigram, on='topics')
print(df)
# Output
topics text bigram1 bigram2 bigram3
0 1 twitter is my favorite social media favorite social social media twitter favorite
1 1 favorite social media favorite social social media twitter favorite
2 2 rt twitter tomorrow twitter tomorrow rt twitter best twitter
3 2 twitter tomorrow twitter tomorrow rt twitter best twitter
4 3 rt facebook today facebook today rt facebook rt twitter
5 3 rt twitter facebook today rt facebook rt twitter
6 4 vote for the best twitter best twitter vote best facebook today
7 4 best twitter best twitter vote best facebook today
Old answer
You can use nltk
:
import nltk
to_bigram = lambda x: list(nltk.bigrams(x.split()))
most_common = (df.set_index('topics')['text'].map(to_bigram)
.groupby(level=0).apply(lambda x: x.mode()[0][0]))
df['bigram'] = df['topics'].map(most_common)
print(df)
# Output
ID topics text bigram
0 1 1 twitter is my favorite social media (favorite, social)
1 2 1 favorite social media (favorite, social)
2 3 2 rt twitter tomorrow (rt, twitter)
3 4 3 rt facebook today (rt, facebook)
4 5 3 rt twitter (rt, facebook)
5 6 4 vote for the best twitter (best, twitter)
6 7 2 twitter tomorrow (rt, twitter)
7 8 4 best twitter (best, twitter)