pythonmemorytf-idfcountvectorizer

Memory Issue: Creating Bigrams and Trigrams with CountVectorizer


I am trying to create a document term matrix using CountVectorizer to extract bigrams and trigrams from a corpus.

from sklearn.feature_extraction.text import CountVectorizer

lemmatized = dat_clean['lemmatized']


c_vec = CountVectorizer(ngram_range=(2,3), lowercase = False)
ngrams = c_vec.fit_transform(lemmatized)
count_values = ngrams.toarray().sum(axis=0)
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True)
        ).rename(columns={0: 'frequency', 1:'bigram/trigram'})

I keep getting the following error:

MemoryError: Unable to allocate 7.89 TiB for an array with shape (84891, 12780210) and data type int64

While I have some experience with Python, I am pretty new to dealing with text data. I was wondering if there was a more memory efficient way to address this issue.

I'm not sure if it is helpful to know, but the ngrams object is a scipy.sparse._csr.csr_matrix.


Solution

  • Solution:

    Here is one way to get the final table your looking for with frequency and bigram/trigram without generating the entire document term matrix. We can take the sum of a sparse matrix and use that to create a dataframe. This removes the need to create space in RAM for all of those missing values.

    # Here we create columns as vocabulary terms and a single row value as count of all terms.
    # We tranpose that to make it an index and a single column
    data = ngrams.sum(axis=0)
    keys = c_vec.vocabulary_.keys()
    df_ngram = pd.DataFrame(data, columns=keys).T
    
    # Get the count to its own column and rename all columns
    df_ngram.index.name = 'bigram/trigram'
    df_ngram.rename({0: 'count'}, inplace=True, axis=1)
    df_ngram.reset_index(inplace=True)
    
    # Calculate frequency of each term
    df_ngram['frequency'] = (df_ngram['count'] / df_ngram['count'].sum())
    df_ngram.sort_values(by=['count'], ascending=False, inplace=True)
    
    df_ngram.head()
    
    #   bigram/trigram  count   frequency
    # 1 (ngram here)    (data) (data)
    

    This could likely be simplified but it certainly does the job.