python matrix scikit-learn gensim text-analysis

Computing top n word pair co-occurrences from document term matrix

I used gensim to create a bag of words model. Although it is much longer in reality, here is the format outputted when creating a bag of words document-term matrix on the tokenized texts using Gensim:

id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

[[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (385, 1),
  (386, 2),
  (387, 3),
  (388, 1),
  (389, 1),
  (390, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

This is a sparse matrix representation, and from what I understand other libraries represent the document-term matrix in a similar fashion as well. If the document-term matrix is non-sparse (meaning the zero entries are there as well), I know that I just have to (A.T*A), since A is of dimension (num. of documents by num. of terms), so multiplying the two will give the term co-occurrences. Ultimately, I want to get the top n co-occurrences (so get the top n term pairs that occur together in the same texts). How would I achieve this? I am not attached to Gensim for creating the BOW model. If another library like sklearn can do it more easily, I am very open. I would appreciate any advice/help/code with this problem -- thanks!

Solution

Edit: Here is how you can achieve the matrix multiplication you asked about. Disclaimer: This might not be feasible for a very large corpus.

Sklearn:

from sklearn.feature_extraction.text import CountVectorizer

Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
docs = [Doc1, Doc2]

# Instantiate CountVectorizer and apply it to docs
cv = CountVectorizer()
doc_cv = cv.fit_transform(docs)

# Display tokens
cv.get_feature_names()

# Display tokens (dict keys) and their numerical encoding (dict values)
cv.vocabulary_

# Matrix multiplication of the term matrix
token_mat = doc_cv.toarray().T @ doc_cv.toarray()

Gensim:

import gensim as gs
import numpy as np

cp = [[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (7, 1),
  (11, 2),
  (13, 3),
  (22, 1),
  (26, 1),
  (30, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

# Convert to a dense matrix and perform the matrix multiplication
mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
mat = np.append(mat_1, mat_2, axis=0)
mat_product = mat.T @ mat

For words that appear consecutively, you could prepare a list of bigrams for a set of documents and then use python's Counter to count the bigram occurrences. Here is an example using nltk.

import nltk
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter

stop_words = set(stopwords.words('english'))

# Get the tokens from the built-in collection of presidential inaugural speeches
tokens = nltk.corpus.inaugural.words()

# Futher text preprocessing
tokens = [t.lower() for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]

# Create bigram list and count bigrams
bi_grams = list(ngrams(tokens, 2)) 
counter = Counter(bi_grams)

# Show the most common bigrams
counter.most_common(5)
Out[36]: 
[(('united', 'state'), 153),
 (('fellow', 'citizen'), 116),
 (('let', 'u'), 99),
 (('i', 'shall'), 96),
 (('american', 'people'), 40)]

# Query the occurrence of a specific bigram
counter[('great', 'people')]
Out[37]: 7