rmetadatatopic-modelingdfm

STM: estimating metadata/topic relationships when starting from dfm


After running an STM model based on a Quanteda dfm, I want to estimate my covariates' effects on certain topics.

Running the STM model went fine, producing the topics as expected, but when using estimateEffect (in the final step in the script below) the R session is aborted, notifying there is a 'fatal error'.

How can I estimate my covariates' effects, when starting from a dfm? The STM manual advices on running an STM model from a dfm, but I couldn't find how to work with the covariates after this stage.

Here's the code:

# Read texts with Quanteda
texts <- (readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM"))  

mycorpus <- corpus(texts)  

tokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1)

mydfm <- dfm(tokens, remove = stopwords("english"), stem = TRUE)


# Run the STM model - Metadata is called with 'data = docvars(mycorpus)'
stm_from_dfm <- stm(mydfm, K = 10, prevalence =~ Date.of.Publication + source, gamma.prior='L1', data = docvars(mycorpus)) 

# Estimate effects
prep <- estimateEffect(1:10 ~ Date.of.Publication + source, stm_from_dfm, 
                       meta = docvars(mycorpus), uncertainty = "Global")

Alternatively, I made an STM corpus from my dfm corpus, using STMcorpus <- asSTMCorpus(mydfm). But then I couldn't run the STM model as it didn't recognized my meta data. Would it be better to follow this alternative strategy? (so I need to associate the meta data with the STMcorpus in some way after running STMcorpus <- asSTMCorpus(mydfm)).


Solution

  • We worked through this by email- but I'll add the answer here for others who might encounter some form of the problem.

    There is a bug in the matrixStats package which causes R to crash with large matrices on Windows only. The bug and solution are detailed here: https://github.com/HenrikBengtsson/matrixStats/issues/104. This issue contains both a simple test of the problem and instructions for how to install the development version of matrixStats which fixes it. This is an issue in version matrixStats 0.52.2 and will presumably be resolved by the next CRAN release.