rsparse-matrixquantedatidytexttext2vec

How to represent each word occurrence as a separate tcm vector in R?


I am looking for an efficient way to create a term co-occurrence matrix for (each) target word in a corpus, such that each occurrence of the word would constitute its own vector (row) in a tcm, where the columns are the context words (i.e., a token-based model of co-occurrence). This is in contrast with the more common apprach used in vector semantics where each term (type) gets a row and a column in a symmetric tcm, and the values are aggregated across the (co-)occurrences of the tokens of the types.

Obviously this could be done from scratch using base R functionality, or hacked by filtering a tcm generated by one of the existing packages that do those, but the corpus data I'm dealing with is rather big (millions of words) - and there are already nice corpus/NLP packages available for R that do these sort of tasks efficiently and store the results in memory-friendly sparse matrices - such as text2vec (function tcm), quanteda (fcm) and tidytext (cast_dtm). Therefore it does not seem to make sense to try to reinvent the wheel (in terms of iterators, hashing and whatnot). But I cannot spot a straightforward way to create a token-based tcm with any of these either; hence this question.

Minimal example:

  library(text2vec)
  library(Matrix)
  library(magrittr)

  # default approach to tcm with text2vec:
  corpus = strsplit(c("here is a short document", "here is a different short document"), " ")
  it = itoken(corpus) 
  tcm = create_vocabulary(it)  %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))

  # results in this:
  print(as.matrix(forceSymmetric(tcm, "U")))

            different here short document is a
  different         0    0     1        1  1 1
  here              0    0     0        0  2 2
  short             1    0     0        2  1 2
  document          1    0     2        0  0 1
  is                1    2     1        0  0 2
  a                 1    2     2        1  2 0

Attempt to get token-based model, for target word "short":

  i=0
  corpus = lapply(corpus, function(x) 
   ifelse(x == "short", {i<<-i+1;paste0("short", i)}, x  ) 
   ) # appends index to each occurrence so itoken distinguishes them
  it = itoken(corpus) 
  tcm = create_vocabulary(it)  %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))
  attempt = as.matrix(forceSymmetric(tcm, "U") %>% 
   .[grep("^short", rownames(.)), -grep("^short", colnames(.))] 
   ) # filters the resulting full tcm

  # yields intended result but is hacky/slow:
  print(attempt)

         different here document is a
  short2         1    0        1  0 1
  short1         0    0        1  1 1

What is a better/faster alternative to this approach to derive a token-based tcm like in the last example? (possibly using one of R packages that already do type-based tcms)


Solution

  • quanteda's fcm is a very efficient way to crate feature co-occurrence matrices wither at the document level or within a user-defined context. This results in a sparse, symmetric feature-by-feature matrix. But it sounds like you want each unique feature to be its own row, and have its target words around that.

    It looks from the example that you want a context window of +/- 2 words, so I have done that for the target word "short".

    First, we get the context using keywords-in-context:

    library("quanteda")
    txt <- c("here is a short document", "here is a different short document")
    
    (shortkwic <- kwic(txt, "short", window = 2))
    #                                          
    # [text1, 4]        is a | short | document
    # [text2, 5] a different | short | document
    

    Then create a corpus from the context, with the keyword as a unique document name:

    shortcorp <- corpus(shortkwic, split_context = FALSE, extract_keyword = TRUE)
    docnames(shortcorp) <- make.unique(docvars(shortcorp, "keyword"))
    texts(shortcorp)
    #                 short                      short.1 
    # "is a short document" "a different short document" 
    

    Then create a dfm, selecting all words, but removing the target:

    dfm(shortcorp) %>%
      dfm_select(dfm(txt)) %>%
      dfm_remove("short")
    # Document-feature matrix of: 2 documents, 5 features (40% sparse).
    # 2 x 5 sparse Matrix of class "dfm"
    #          features
    # docs      here is a document different
    #   short      0  1 1        1         0
    #   short.1    0  0 1        1         1