I am looking to create a co-occurrence matrix with bigrams in stead of unigrams from a single string. I am referring the following links
http://text2vec.org/glove.html
https://tm4ss.github.io/docs/Tutorial_5_Co-occurrence.html#3_statistical_significance
I want to create the matrix and traverse it to create dataset as follows
Trem1 Term2 Score
The biggest catch being traversing the sentence with bigrams. Any help on this would be great
Just specify your bigrams and create the co-occurence matrices. Below are some (really) simple examples. Choose 1 package and do everything with that one. Both quanteda and text2vec can use multiple cores / threads. Traversing over the resulting co-occurence matrices can be done with reshape2::melt, like this reshape2::melt(as.matrix(my_cooccurence_matrix))
.
txt <- c("The quick brown fox jumped over the lazy dog.",
"The dog jumped and ate the fox.")
using quanteda to create a feature co-occurrence matrix:
library(quanteda)
toks <- tokens(char_tolower(txt), remove_punct = TRUE, ngrams = 2)
f <- fcm(toks, context = "document")
Feature co-occurrence matrix of: 14 by 14 features.
14 x 14 sparse Matrix of class "fcm"
features
features the_quick quick_brown brown_fox fox_jumped jumped_over over_the the_lazy lazy_dog the_dog dog_jumped jumped_and and_ate
the_quick 0 1 1 1 1 1 1 1 0 0 0 0
quick_brown 0 0 1 1 1 1 1 1 0 0 0 0
brown_fox 0 0 0 1 1 1 1 1 0 0 0 0
fox_jumped 0 0 0 0 1 1 1 1 0 0 0 0
jumped_over 0 0 0 0 0 1 1 1 0 0 0 0
over_the 0 0 0 0 0 0 1 1 0 0 0 0
the_lazy 0 0 0 0 0 0 0 1 0 0 0 0
lazy_dog 0 0 0 0 0 0 0 0 0 0 0 0
the_dog 0 0 0 0 0 0 0 0 0 1 1 1
dog_jumped 0 0 0 0 0 0 0 0 0 0 1 1
jumped_and 0 0 0 0 0 0 0 0 0 0 0 1
and_ate 0 0 0 0 0 0 0 0 0 0 0 0
ate_the 0 0 0 0 0 0 0 0 0 0 0 0
the_fox 0 0 0 0 0 0 0 0 0 0 0 0
features
features ate_the the_fox
the_quick 0 0
quick_brown 0 0
brown_fox 0 0
fox_jumped 0 0
jumped_over 0 0
over_the 0 0
the_lazy 0 0
lazy_dog 0 0
the_dog 1 1
dog_jumped 1 1
jumped_and 1 1
and_ate 1 1
ate_the 0 1
the_fox 0 0
using text2vec to create a feature co-occurrence matrix:
library(text2vec)
i <- itoken(txt)
v <- create_vocabulary(i, ngram = c(2L, 2L))
vectorizer <- vocab_vectorizer(v)
f2 <- create_tcm(i, vectorizer)
14 sparse Matrix of class "dgTMatrix"
[[ suppressing 14 column names ‘the_lazy’, ‘and_ate’, ‘The_quick’ ... ]]
the_lazy . . . 0.25 1.0 . 0.2 0.3333333 . . 1.0000000 . 0.5000000 .
and_ate . . . . . 1 . . 0.5000000 1.0 . 0.3333333 . 0.5000000
The_quick . . . 0.50 . . 1.0 0.3333333 . . 0.2000000 . 0.2500000 .
brown_fox . . . . 0.2 . 1.0 1.0000000 . . 0.3333333 . 0.5000000 .
lazy_dog. . . . . . . . 0.2500000 . . 0.5000000 . 0.3333333 .
jumped_and . . . . . . . . 0.3333333 0.5 . 0.5000000 . 1.0000000
quick_brown . . . . . . . 0.5000000 . . 0.2500000 . 0.3333333 .
fox_jumped . . . . . . . . . . 0.5000000 . 1.0000000 .
the_fox. . . . . . . . . . 1.0 . 0.2000000 . 0.2500000
ate_the . . . . . . . . . . . 0.2500000 . 0.3333333
over_the . . . . . . . . . . . . 1.0000000 .
The_dog . . . . . . . . . . . . . 1.0000000
jumped_over . . . . . . . . . . . . . .
dog_jumped . . . . . . . . . . . . . .