r nlp sparse-matrix text-mining text2vec

Create Co-occurrence matrix with bigrams

I am looking to create a co-occurrence matrix with bigrams in stead of unigrams from a single string. I am referring the following links

http://text2vec.org/glove.html

https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html#3_statistical_significance

I want to create the matrix and traverse it to create dataset as follows

Trem1     Term2     Score

The biggest catch being traversing the sentence with bigrams. Any help on this would be great

Solution

Just specify your bigrams and create the co-occurence matrices. Below are some (really) simple examples. Choose 1 package and do everything with that one. Both quanteda and text2vec can use multiple cores / threads. Traversing over the resulting co-occurence matrices can be done with reshape2::melt, like this reshape2::melt(as.matrix(my_cooccurence_matrix)).

txt <- c("The quick brown fox jumped over the lazy dog.",
         "The dog jumped and ate the fox.")

using quanteda to create a feature co-occurrence matrix:

library(quanteda)
toks <- tokens(char_tolower(txt), remove_punct = TRUE, ngrams = 2)
f <- fcm(toks, context = "document")

Feature co-occurrence matrix of: 14 by 14 features.
14 x 14 sparse Matrix of class "fcm"
             features
features      the_quick quick_brown brown_fox fox_jumped jumped_over over_the the_lazy lazy_dog the_dog dog_jumped jumped_and and_ate
  the_quick           0           1         1          1           1        1        1        1       0          0          0       0
  quick_brown         0           0         1          1           1        1        1        1       0          0          0       0
  brown_fox           0           0         0          1           1        1        1        1       0          0          0       0
  fox_jumped          0           0         0          0           1        1        1        1       0          0          0       0
  jumped_over         0           0         0          0           0        1        1        1       0          0          0       0
  over_the            0           0         0          0           0        0        1        1       0          0          0       0
  the_lazy            0           0         0          0           0        0        0        1       0          0          0       0
  lazy_dog            0           0         0          0           0        0        0        0       0          0          0       0
  the_dog             0           0         0          0           0        0        0        0       0          1          1       1
  dog_jumped          0           0         0          0           0        0        0        0       0          0          1       1
  jumped_and          0           0         0          0           0        0        0        0       0          0          0       1
  and_ate             0           0         0          0           0        0        0        0       0          0          0       0
  ate_the             0           0         0          0           0        0        0        0       0          0          0       0
  the_fox             0           0         0          0           0        0        0        0       0          0          0       0
             features
features      ate_the the_fox
  the_quick         0       0
  quick_brown       0       0
  brown_fox         0       0
  fox_jumped        0       0
  jumped_over       0       0
  over_the          0       0
  the_lazy          0       0
  lazy_dog          0       0
  the_dog           1       1
  dog_jumped        1       1
  jumped_and        1       1
  and_ate           1       1
  ate_the           0       1
  the_fox           0       0

using text2vec to create a feature co-occurrence matrix:

library(text2vec)
i <- itoken(txt)
v <- create_vocabulary(i, ngram = c(2L, 2L))
vectorizer <- vocab_vectorizer(v) 
f2 <- create_tcm(i, vectorizer)

14 sparse Matrix of class "dgTMatrix"
   [[ suppressing 14 column names ‘the_lazy’, ‘and_ate’, ‘The_quick’ ... ]]

the_lazy    . . . 0.25 1.0 . 0.2 0.3333333 .         .   1.0000000 .         0.5000000 .        
and_ate     . . . .    .   1 .   .         0.5000000 1.0 .         0.3333333 .         0.5000000
The_quick   . . . 0.50 .   . 1.0 0.3333333 .         .   0.2000000 .         0.2500000 .        
brown_fox   . . . .    0.2 . 1.0 1.0000000 .         .   0.3333333 .         0.5000000 .        
lazy_dog.   . . . .    .   . .   0.2500000 .         .   0.5000000 .         0.3333333 .        
jumped_and  . . . .    .   . .   .         0.3333333 0.5 .         0.5000000 .         1.0000000
quick_brown . . . .    .   . .   0.5000000 .         .   0.2500000 .         0.3333333 .        
fox_jumped  . . . .    .   . .   .         .         .   0.5000000 .         1.0000000 .        
the_fox.    . . . .    .   . .   .         .         1.0 .         0.2000000 .         0.2500000
ate_the     . . . .    .   . .   .         .         .   .         0.2500000 .         0.3333333
over_the    . . . .    .   . .   .         .         .   .         .         1.0000000 .        
The_dog     . . . .    .   . .   .         .         .   .         .         .         1.0000000
jumped_over . . . .    .   . .   .         .         .   .         .         .         .        
dog_jumped  . . . .    .   . .   .         .         .   .         .         .         .