How to compute the total number of words and vocabulary of a corpus stored as a list in python? What is the major difference between these two terms?
Suppose, I am using the following list. The total number of words or the length of the list can be computed by len(L1)
. However, I am interested to know how to calculate the vocabulary of the below mentioned list.
L1 = ['newnes', 'imprint', 'elsevier', 'elsevier', 'corporate', 'drive', 'suite',
'burlington', 'usa', 'linacre', 'jordan', 'hill', 'oxford', 'uk',
'elsevier', 'inc', 'right', 'reserved', 'exception', 'newness', 'uk', 'military',
'organization', 'summary', 'task', 'definition', 'system', 'definition',
'system', 'engineering', 'military', 'project', 'military', 'project',
'definition', 'input', 'output', 'operation', 'requirement', 'development',
'overview', 'spacecraft', 'development', 'architecture', 'design']
Is this what you're looking for?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
list_of_tokens = ['cat', 'dog','cats', 'children','dog']
unique_tokens = set(list_of_tokens)
### {'cat', 'cats', 'children', 'dog'}
tokens_lemmatized = [ lemmatizer.lemmatize(token) for token in unique_tokens]
#### ['child', 'cat', 'cat', 'dog']
unique_tokens_lemmatized = set(tokens_lemmatized)
#### {'cat', 'child', 'dog'}
print('Input tokens:',len(list_of_tokens) , 'Lemmmatized tokens:', len(unique_tokens_lemmatized)
#### Input tokens: 5 Lemmmatized tokens: 3