pythonpandasnlpvocabularyoov

find words out of vocabulary


I have some texts in a pandas dataframe df['mytext'] I have also got a vocabulary vocab (list of words).

I am trying to list and count the words out of vocabulary for each document

I have tried the following but it is quite slow for 10k documents.

How to quickly and efficiently quantify the out of vocabulary tokens in collection of texts in pandas?

OOV_text=df['mytext'].apply(lambda s: ' '.join([ word  for word in s.split() if (word not in vocab) ]))
OOV=df['mytext'].apply(lambda s: sum([(word in vocab) for word in s.split()])/len(s.split()))

df.shape[0] is quite large len(vocab) is large len(unique words in df.mytext)<<len(vocab)


Solution

  • You can use

    from collections import Counter
    vocab=['word1','word2','word3','2021']
    df['mytext_list']=df['mytext'].str.split(' ')
    df['count']=df['mytext_list'].apply(lambda c:sum([Counter(c)[w] for w in vocab]))
    

    It should be faster than your solution because it uses pandas vectorization and then the Counter method.

    You can skip saving the helper column as "mytest_list" for saving memory usage.