pythonmachine-learningscikit-learnnlptfidfvectorizer

Feeding my classifier one document at a time


I want my ModelBuilder class to feed a self._classifier = MultinomialNB() with the content of some webpages I scraped. The documents are many and pretty big, so I can't load the whole set to memory. I'm reading them file by file. Here's the relevant portion of code:

X = []  
y = []

# loop over all files in my docs folder and for each file:
X.append(self._vectorize_text(file.read()))
y.append(category['label'])
# end of the loop

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
self._classifier.fit(X_train, y_train)
# ...

Text pre-processing and vectorization functions:

def _vectorize_text(self, text):
    preprocessed_content = preprocess_text(text)
    tfidf_vector = self._vectorizer.fit_transform([preprocessed_content])
    return tfidf_vector.toarray()
def preprocess_text(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalnum() and word.lower() not in stopwords.words('english')]
    cleaned_text = ' '.join(words)
    return cleaned_text

I get an error at classifier.fit when I start training my model:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (121, 1) + inhomogeneous part.

How can I resolve this?


Solution

  • You're going to have two problems doing this:

    1. Vocabulary. Suppose that in document #2, you have the word "eggs." In document #1, the word "eggs" does not appear. In order for the vector for #1 and the vector for #2 to have the same meaning, the vectorizer for step #1 needs to know to leave an empty column for the word "eggs."

      The immediate symptom you see is that the vectors end up having different lengths, and NumPy cannot represent jagged arrays. If you tried to solve this by padding all arrays to the same length, you would have the problem that counts representing the same word are assigned to different columns.

      One approach to solving this is to run a CountVectorizer over your dataset, keeping the vocabulary from each document, but throwing away the vector. Then, you use TF-IDF with a fixed vocabulary representing all words that appear in your dataset.

      This answer describes how to do this.

    2. Inverse Document Frequency. TF-IDF is term frequency, within one document, multiplied by inverse document frequency for that term, within all documents.

      If you fit this one document at a time, you're essentially setting the IDF term to 1.

      In order to compute IDF, it must have either all documents, or at least a count of how many documents have each term in the vocabulary.

      This answer has a library which can deal with this problem; I haven't personally tried it.