I have some texts in a pandas dataframe df['mytext']
I have also got a vocabulary vocab
(list of words).
I am trying to list and count the words out of vocabulary for each document
I have tried the following but it is quite slow for 10k documents.
How to quickly and efficiently quantify the out of vocabulary tokens in collection of texts in pandas?
OOV_text=df['mytext'].apply(lambda s: ' '.join([ word for word in s.split() if (word not in vocab) ]))
OOV=df['mytext'].apply(lambda s: sum([(word in vocab) for word in s.split()])/len(s.split()))
df.shape[0] is quite large len(vocab) is large len(unique words in df.mytext)<<len(vocab)
You can use
from collections import Counter
vocab=['word1','word2','word3','2021']
df['mytext_list']=df['mytext'].str.split(' ')
df['count']=df['mytext_list'].apply(lambda c:sum([Counter(c)[w] for w in vocab]))
It should be faster than your solution because it uses pandas vectorization and then the Counter method.
You can skip saving the helper column as "mytest_list" for saving memory usage.