I got a corpus that basically a vector of short sentences (n > 50), e.g.:
corpus <- c("looking for help in R","check whether my milk is sour or not",
"random sentence with dubious meaning")
I am able to print a dendrogram
fit <- hclust(d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=nc) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=nc, border="red") # draw dendrogram with red borders around the 5 clusters
and a correlation matrix
cor_1 <- cor(as.matrix(dtms))
corrplot(cor_1, method = "number")
As far as I have understood it - please correct me here if I am wrong - findAssocs()
i.e. correlation checks whether two terms appear in the same document?
Goal: Now I don't want to see the correlation, but the frequency of two terms appear in the same document which are NOT necessarily adjacent to each other (BigramTokenizer won't work). For example: term A and term B appear together in 5 different documents across my corpus regardless of distance.
Ideally I want to create a frequency matrix similar to the one above and add the frequencies to the dendrogram if possible (akin to where pvclust()
prints their numbers)
Any ideas on how to achieve this?
I think you are asking how to get a co-occurrence matrix for terms, where a the cells are the number of documents in which a term occurs with another document. We can accomplish this magic using a matrix cross-product of the transpose of the matrix with itself, after converting the matrix of document-term frequencies to Boolean values indicating whether a term occurred in a document.
(I've used the quanteda package here instead of tm but a similar approach will work with a DocumentTermMatrix
object from tm.)
# create some demonstration documents
(txts <- c(paste(letters[c(1, 1:3)], collapse = " "),
paste(letters[c(1, 3, 5)], collapse = " "),
paste(letters[c(5, 6, 7)], collapse = " ")))
## [1] "a a b c" "a c e" "e f g"
# convert to a document-term matrix
require(quanteda)
dtm <- dfm(txts, verbose = FALSE)
dtm
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c e f g
## text1 2 1 1 0 0 0
## text2 1 0 1 1 0 0
## text3 0 0 0 1 1 1
# convert to a matrix of co-occcurences rather than counts
(dtm <- tf(dtm, "boolean"))
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c e f g
## text1 1 1 1 0 0 0
## text2 1 0 1 1 0 0
## text3 0 0 0 1 1 1
# now get the "feature in document" co-occurrence matrix
t(dtm) %*% dtm
## 6 x 6 sparse Matrix of class "dgCMatrix"
## a b c e f g
## a 2 1 2 1 . .
## b 1 1 1 . . .
## c 2 1 2 1 . .
## e 1 . 1 2 1 1
## f . . . 1 1 1
## g . . . 1 1 1
Note: This setup counts a term as "co-occurring" once in a document where it appears only with itself (e.g. b
). If you want to change that, simply replace the diagonal with the diagonal minus one.