I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:
tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')
tfidf.fit_transform(tokenized_list_of_sentences)
which errors out as
AttributeError: 'list' object has no attribute 'lower'
is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.
Try initializing the TfidfVectorizer
object with the parameter lowercase=False
(assuming this is actually desired as you've lowercased your tokens in previous stages).
tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
tfidf.fit_transform(tokenized_list_of_sentences)
Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.