pythonmatrixscikit-learngensimtext-analysis

Computing top n word pair co-occurrences from document term matrix


I used gensim to create a bag of words model. Although it is much longer in reality, here is the format outputted when creating a bag of words document-term matrix on the tokenized texts using Gensim:

id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

[[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 11),
  (385, 1),
  (386, 2),
  (387, 3),
  (388, 1),
  (389, 1),
  (390, 1)],
 [(4, 31),
  (8, 2),
  (13, 2),
  (16, 2),
  (17, 2),
  (26, 1),
  (28, 4),
  (29, 1),
  (30, 1)]]

This is a sparse matrix representation, and from what I understand other libraries represent the document-term matrix in a similar fashion as well. If the document-term matrix is non-sparse (meaning the zero entries are there as well), I know that I just have to (A.T*A), since A is of dimension (num. of documents by num. of terms), so multiplying the two will give the term co-occurrences. Ultimately, I want to get the top n co-occurrences (so get the top n term pairs that occur together in the same texts). How would I achieve this? I am not attached to Gensim for creating the BOW model. If another library like sklearn can do it more easily, I am very open. I would appreciate any advice/help/code with this problem -- thanks!


Solution

  • Edit: Here is how you can achieve the matrix multiplication you asked about. Disclaimer: This might not be feasible for a very large corpus.

    Sklearn:

    from sklearn.feature_extraction.text import CountVectorizer
    
    Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
    Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
    docs = [Doc1, Doc2]
    
    # Instantiate CountVectorizer and apply it to docs
    cv = CountVectorizer()
    doc_cv = cv.fit_transform(docs)
    
    # Display tokens
    cv.get_feature_names()
    
    # Display tokens (dict keys) and their numerical encoding (dict values)
    cv.vocabulary_
    
    # Matrix multiplication of the term matrix
    token_mat = doc_cv.toarray().T @ doc_cv.toarray()
    

    Gensim:

    import gensim as gs
    import numpy as np
    
    cp = [[(0, 2),
      (1, 1),
      (2, 1),
      (3, 1),
      (4, 11),
      (7, 1),
      (11, 2),
      (13, 3),
      (22, 1),
      (26, 1),
      (30, 1)],
     [(4, 31),
      (8, 2),
      (13, 2),
      (16, 2),
      (17, 2),
      (26, 1),
      (28, 4),
      (29, 1),
      (30, 1)]]
    
    # Convert to a dense matrix and perform the matrix multiplication
    mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
    mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
    mat = np.append(mat_1, mat_2, axis=0)
    mat_product = mat.T @ mat
    

    For words that appear consecutively, you could prepare a list of bigrams for a set of documents and then use python's Counter to count the bigram occurrences. Here is an example using nltk.

    import nltk
    from nltk.util import ngrams
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    from collections import Counter
    
    stop_words = set(stopwords.words('english'))
    
    # Get the tokens from the built-in collection of presidential inaugural speeches
    tokens = nltk.corpus.inaugural.words()
    
    # Futher text preprocessing
    tokens = [t.lower() for t in tokens if t not in stop_words]
    word_l = WordNetLemmatizer()
    tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
    
    # Create bigram list and count bigrams
    bi_grams = list(ngrams(tokens, 2)) 
    counter = Counter(bi_grams)
    
    # Show the most common bigrams
    counter.most_common(5)
    Out[36]: 
    [(('united', 'state'), 153),
     (('fellow', 'citizen'), 116),
     (('let', 'u'), 99),
     (('i', 'shall'), 96),
     (('american', 'people'), 40)]
    
    # Query the occurrence of a specific bigram
    counter[('great', 'people')]
    Out[37]: 7