scikit-learntfidfvectorizer

Use sklearn TfidfVectorizer with already tokenized inputs?


I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

which errors out as

AttributeError: 'list' object has no attribute 'lower'

is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.


Solution

  • Try initializing the TfidfVectorizer object with the parameter lowercase=False (assuming this is actually desired as you've lowercased your tokens in previous stages).

    tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
    
    def identity_tokenizer(text):
        return text
    
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
    tfidf.fit_transform(tokenized_list_of_sentences)
    

    Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.