rtm

R :: tm - Create a table/matrix of term association frequencies and add values to dendrogram


I got a corpus that basically a vector of short sentences (n > 50), e.g.:

corpus <- c("looking for help in R","check whether my milk is sour or not",
"random sentence with dubious meaning")

I am able to print a dendrogram

fit <- hclust(d, method="ward")   
plot(fit, hang=-1)
groups <- cutree(fit, k=nc)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=nc, border="red") # draw dendrogram with red borders around the 5 clusters 

and a correlation matrix

cor_1 <- cor(as.matrix(dtms))
corrplot(cor_1, method = "number")

As far as I have understood it - please correct me here if I am wrong - findAssocs() i.e. correlation checks whether two terms appear in the same document?

Goal: Now I don't want to see the correlation, but the frequency of two terms appear in the same document which are NOT necessarily adjacent to each other (BigramTokenizer won't work). For example: term A and term B appear together in 5 different documents across my corpus regardless of distance.

Ideally I want to create a frequency matrix similar to the one above and add the frequencies to the dendrogram if possible (akin to where pvclust() prints their numbers)

enter image description here

Any ideas on how to achieve this?


Solution

  • I think you are asking how to get a co-occurrence matrix for terms, where a the cells are the number of documents in which a term occurs with another document. We can accomplish this magic using a matrix cross-product of the transpose of the matrix with itself, after converting the matrix of document-term frequencies to Boolean values indicating whether a term occurred in a document.

    (I've used the quanteda package here instead of tm but a similar approach will work with a DocumentTermMatrix object from tm.)

    # create some demonstration documents
    (txts <- c(paste(letters[c(1, 1:3)], collapse = " "), 
               paste(letters[c(1, 3, 5)], collapse = " "), 
               paste(letters[c(5, 6, 7)], collapse = " ")))
    ## [1] "a a b c" "a c e" "e f g"
    
    # convert to a document-term matrix
    require(quanteda)
    dtm <- dfm(txts, verbose = FALSE)
    dtm
    ## Document-feature matrix of: 3 documents, 6 features.
    ## 3 x 6 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    a b c e f g
    ##   text1 2 1 1 0 0 0
    ##   text2 1 0 1 1 0 0
    ##   text3 0 0 0 1 1 1
    
    # convert to a matrix of co-occcurences rather than counts
    (dtm <- tf(dtm, "boolean"))
    ## Document-feature matrix of: 3 documents, 6 features.
    ## 3 x 6 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    a b c e f g
    ##   text1 1 1 1 0 0 0
    ##   text2 1 0 1 1 0 0
    ##   text3 0 0 0 1 1 1
    
    # now get the "feature in document" co-occurrence matrix
    t(dtm) %*% dtm
    ## 6 x 6 sparse Matrix of class "dgCMatrix"
    ##   a b c e f g
    ## a 2 1 2 1 . .
    ## b 1 1 1 . . .
    ## c 2 1 2 1 . .
    ## e 1 . 1 2 1 1
    ## f . . . 1 1 1
    ## g . . . 1 1 1
    

    Note: This setup counts a term as "co-occurring" once in a document where it appears only with itself (e.g. b). If you want to change that, simply replace the diagonal with the diagonal minus one.