rnlpsparse-matrixtext-miningtext2vec

Create Co-occurrence matrix with bigrams


I am looking to create a co-occurrence matrix with bigrams in stead of unigrams from a single string. I am referring the following links

http://text2vec.org/glove.html

https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html#3_statistical_significance

I want to create the matrix and traverse it to create dataset as follows

Trem1     Term2     Score

The biggest catch being traversing the sentence with bigrams. Any help on this would be great


Solution

  • Just specify your bigrams and create the co-occurence matrices. Below are some (really) simple examples. Choose 1 package and do everything with that one. Both quanteda and text2vec can use multiple cores / threads. Traversing over the resulting co-occurence matrices can be done with reshape2::melt, like this reshape2::melt(as.matrix(my_cooccurence_matrix)).

    txt <- c("The quick brown fox jumped over the lazy dog.",
             "The dog jumped and ate the fox.")
    

    using quanteda to create a feature co-occurrence matrix:

    library(quanteda)
    toks <- tokens(char_tolower(txt), remove_punct = TRUE, ngrams = 2)
    f <- fcm(toks, context = "document")
    
    Feature co-occurrence matrix of: 14 by 14 features.
    14 x 14 sparse Matrix of class "fcm"
                 features
    features      the_quick quick_brown brown_fox fox_jumped jumped_over over_the the_lazy lazy_dog the_dog dog_jumped jumped_and and_ate
      the_quick           0           1         1          1           1        1        1        1       0          0          0       0
      quick_brown         0           0         1          1           1        1        1        1       0          0          0       0
      brown_fox           0           0         0          1           1        1        1        1       0          0          0       0
      fox_jumped          0           0         0          0           1        1        1        1       0          0          0       0
      jumped_over         0           0         0          0           0        1        1        1       0          0          0       0
      over_the            0           0         0          0           0        0        1        1       0          0          0       0
      the_lazy            0           0         0          0           0        0        0        1       0          0          0       0
      lazy_dog            0           0         0          0           0        0        0        0       0          0          0       0
      the_dog             0           0         0          0           0        0        0        0       0          1          1       1
      dog_jumped          0           0         0          0           0        0        0        0       0          0          1       1
      jumped_and          0           0         0          0           0        0        0        0       0          0          0       1
      and_ate             0           0         0          0           0        0        0        0       0          0          0       0
      ate_the             0           0         0          0           0        0        0        0       0          0          0       0
      the_fox             0           0         0          0           0        0        0        0       0          0          0       0
                 features
    features      ate_the the_fox
      the_quick         0       0
      quick_brown       0       0
      brown_fox         0       0
      fox_jumped        0       0
      jumped_over       0       0
      over_the          0       0
      the_lazy          0       0
      lazy_dog          0       0
      the_dog           1       1
      dog_jumped        1       1
      jumped_and        1       1
      and_ate           1       1
      ate_the           0       1
      the_fox           0       0
    

    using text2vec to create a feature co-occurrence matrix:

    library(text2vec)
    i <- itoken(txt)
    v <- create_vocabulary(i, ngram = c(2L, 2L))
    vectorizer <- vocab_vectorizer(v) 
    f2 <- create_tcm(i, vectorizer)
    
    14 sparse Matrix of class "dgTMatrix"
       [[ suppressing 14 column names ‘the_lazy’, ‘and_ate’, ‘The_quick’ ... ]]
    
    the_lazy    . . . 0.25 1.0 . 0.2 0.3333333 .         .   1.0000000 .         0.5000000 .        
    and_ate     . . . .    .   1 .   .         0.5000000 1.0 .         0.3333333 .         0.5000000
    The_quick   . . . 0.50 .   . 1.0 0.3333333 .         .   0.2000000 .         0.2500000 .        
    brown_fox   . . . .    0.2 . 1.0 1.0000000 .         .   0.3333333 .         0.5000000 .        
    lazy_dog.   . . . .    .   . .   0.2500000 .         .   0.5000000 .         0.3333333 .        
    jumped_and  . . . .    .   . .   .         0.3333333 0.5 .         0.5000000 .         1.0000000
    quick_brown . . . .    .   . .   0.5000000 .         .   0.2500000 .         0.3333333 .        
    fox_jumped  . . . .    .   . .   .         .         .   0.5000000 .         1.0000000 .        
    the_fox.    . . . .    .   . .   .         .         1.0 .         0.2000000 .         0.2500000
    ate_the     . . . .    .   . .   .         .         .   .         0.2500000 .         0.3333333
    over_the    . . . .    .   . .   .         .         .   .         .         1.0000000 .        
    The_dog     . . . .    .   . .   .         .         .   .         .         .         1.0000000
    jumped_over . . . .    .   . .   .         .         .   .         .         .         .        
    dog_jumped  . . . .    .   . .   .         .         .   .         .         .         .