pythonpython-3.xscikit-learncountvectorizer

CountVectorizer returning zeros


I have a vocabulary text file where each line is a word. Few words from vocabulary are shown below:

AccountsAndTransactions_/get/v2/accounts/details_DELETE
AccountsAndTransactions_/get/v2/accounts/details_GET
AccountsAndTransactions_/get/v2/accounts/details_POST
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_DELETE
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_GET
AccountsAndTransactions_/get/v2/accounts/{accountId}/transactions_POST

Important: AccountsAndTransactions_/get/v2/accounts/details_DELETE this is a single word in this problem.

Reading vocabulary from text file:

with open(Path(VOCAB_FILE), "r") as f:
    vocab = f.read().splitlines()

Generating doc_paths:

doc_paths = [f for f in listdir(DOC_DIR) if isfile(join(DOC_DIR, f))]
r = re.compile(".*txt")
doc_paths = list(filter(r.match, doc_paths))
doc_paths = [Path(join(DOC_DIR, i)) for i in doc_paths]

I am running CountVectorizer on documents.

tf_vectorizer = CountVectorizer(input='filename', lowercase=False, vocabulary=vocab)
tf = tf_vectorizer.fit_transform(doc_paths) # doc_paths is list of pathlib.Path(...) object.
X = tf.toarray() # returns zero matrix

The issue is all the values in X are zero. (The corpus-documents are not empty.)

Could someone help me? I want the term frequency of every word in vocabulary for each document.


Solution

  • I solved this problem by overriding default analyzer of CountVectorizer:

    def analyzer_custom(doc):
        return doc.split()
    
    tf_vectorizer = CountVectorizer(input='filename',
                                    lowercase=False,
                                    vocabulary=vocab,
                                    analyzer=analyzer_custom)
    

    Thanks to @Chris for explaining internal details of CountVectorizer.